Skip to content

Conversation

@ervteng
Copy link
Contributor

@ervteng ervteng commented Mar 16, 2021

Proposed change(s)

Move POCA critic to default device. Was failing Yamato GPU test without this. Needs to be cherry-picked into R15.

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments

@ervteng ervteng requested a review from dongruoping March 16, 2021 15:57
Copy link
Contributor

@vincentpierre vincentpierre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you manage to train on GPU after this fix ?

@ervteng
Copy link
Contributor Author

ervteng commented Mar 16, 2021

Do you manage to train on GPU after this fix ?

It's still failing: (Yamato job here: https://yamato.cds.internal.unity3d.com/jobs/497-ml-agents/tree/develop-poca-gpu/.yamato%252Fpytest-gpu.yml%2523pytest_gpu/5821068/job) need to debug

@dongruoping
Copy link
Contributor

One thing I noticed in networks.py:line 271
attn_mask = only_first_obs_flat.isnan().type(torch.FloatTensor)
this will move tensors on cpu.

Since we're using default tensor type we should avoid calling torch.FloatTensor and use .float() instead.
I've fixed several once before so I figured I should explicitly call this out.

@dongruoping
Copy link
Contributor

Do you think we should enable gpu tests in all python PRs? I think we've seen couple times this fails in nightly CI and would be nice to catch them earlier.

@ervteng
Copy link
Contributor Author

ervteng commented Mar 16, 2021

@dongruoping yep just caught that, the fix works 👍

@ervteng
Copy link
Contributor Author

ervteng commented Mar 16, 2021

Do you think we should enable gpu tests in all python PRs? I think we've seen couple times this fails in nightly CI and would be nice to catch them earlier.

I agree. Spoke with @surfnerd about this; I've added it to the Release Checklist. I guess we have to weight the cost of spinning up a GPU machine vs. catching these earlier. Not sure what the tradeoff is.

@dongruoping
Copy link
Contributor

@dongruoping yep just caught that, the fix works 👍

Can you also fix the torch.BoolTensor in the same file?

@ervteng ervteng merged commit f6a640f into main Mar 16, 2021
@delete-merged-branch delete-merged-branch bot deleted the develop-poca-gpu branch March 16, 2021 21:07
ervteng pushed a commit that referenced this pull request Mar 16, 2021
* Move critic to default device

* Make sure to clone onto default device

* Add some debug stuff

* Some more debug

* Fix issue

* Fix bool tensor too
ervteng pushed a commit that referenced this pull request Mar 16, 2021
* Move critic to default device

* Make sure to clone onto default device

* Add some debug stuff

* Some more debug

* Fix issue

* Fix bool tensor too
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants