Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFError at connection.py #12

Closed
yanghaoxiang7 opened this issue Jan 9, 2022 · 5 comments
Closed

EOFError at connection.py #12

yanghaoxiang7 opened this issue Jan 9, 2022 · 5 comments

Comments

@yanghaoxiang7
Copy link

@alexfrom0815

During training I meet with a problem:

...
    (critic): Sequential(
      (0): Conv2d(64, 4, kernel_size=(1, 1), stride=(1, 1))
      (1): ReLU()
      (2): Flatten()
      (3): Linear(in_features=400, out_features=256, bias=True)
      (4): ReLU()
    )
    (critic_linear): Linear(in_features=256, out_features=1, bias=True)
  )
  (dist): Categorical(
    (linear): Linear(in_features=256, out_features=100, bias=True)
  )
)
Rotation: False
Process ForkProcess-1:
Traceback (most recent call last):
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/HDD_4T/Reps/baselines/baselines/common/vec_env/shmem_vec_env.py", line 123, in _subproc_worker
    cmd, data = pipe.recv()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Debugging by printing out information, I found the problem of a segmentation fault around here:

(at kfac.py)
      if self.steps % self.Tf == 0:
          # My asynchronous implementation exists, I will add it later.
          # Experimenting with different ways to this in PyTorch.

          self.d_g[m], self.Q_g[m] = torch.symeig(
              self.m_gg[m], eigenvectors=True)
          self.d_a[m], self.Q_a[m] = torch.symeig(
              self.m_aa[m], eigenvectors=True)

          self.d_a[m].mul_((self.d_a[m] > 1e-6).float())
          self.d_g[m].mul_((self.d_g[m] > 1e-6).float())

I guess my problem is at torch.symeig, since I found several issues about this. But different from their running, the code stopped at the first episode (instead of stopping after several hours of training). Is there any solution to this problem? Great thanks!

@yanghaoxiang7
Copy link
Author

BTW, I can run the training code with A2C and the testing code.

@yanghaoxiang7
Copy link
Author

I see that there's a possible way to add to "mask value" but I couldn't find it in config.py

@yanghaoxiang7
Copy link
Author

bug fixed.
Problem at acktr/algo/kfac.py.
I don't know why but torch.symeig is only compatible under CPU. Running under GPU will lead to a segmentation fault.
Solution:

                self.d_g[m], self.Q_g[m] = torch.symeig(
                    self.m_gg[m].cpu(), eigenvectors=True)
                self.d_g[m], self.Q_g[m] = self.d_g[m].cuda(), self.Q_g[m].cuda()
                self.d_a[m], self.Q_a[m] = torch.symeig(
                    self.m_aa[m].cpu(), eigenvectors=True)
                self.d_a[m], self.Q_a[m] = self.d_a[m].cuda(), self.Q_a[m].cuda()

I'm using torch1.7.1 + cuda 11. Not sure why this happen.

@suoyike1
Copy link

suoyike1 commented Jul 8, 2024

BTW, I can run the training code with A2C and the testing code.

how to train this model with a2c?
when I run this training code with a2c will have a mistake as follow
Traceback (most recent call last):
File "main.py", line 233, in
main(args)
File "main.py", line 24, in main
train_model(args)
File "main.py", line 99, in train_model
args.lr,
AttributeError: 'Namespace' object has no attribute 'lr'

@yanghaoxiang7
Copy link
Author

BTW, I can run the training code with A2C and the testing code.

how to train this model with a2c? when I run this training code with a2c will have a mistake as follow Traceback (most recent call last): File "main.py", line 233, in main(args) File "main.py", line 24, in main train_model(args) File "main.py", line 99, in train_model args.lr, AttributeError: 'Namespace' object has no attribute 'lr'

Your errors indicates that your "args" does not have "lr". "lr" is the learning rate and is typically passed through the command line arguments ("args"). Check whether you run the code according to authors' information and you can directly use print("args:", args) to debug. Hope these helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants