Skip to content

Comments

Add torch_utils class, auto-detect CUDA availability #4403

Merged
ervteng merged 15 commits intomasterfrom
develop-torch_utils
Aug 28, 2020
Merged

Add torch_utils class, auto-detect CUDA availability #4403
ervteng merged 15 commits intomasterfrom
develop-torch_utils

Conversation

@ervteng
Copy link
Contributor

@ervteng ervteng commented Aug 21, 2020

Proposed change(s)

This PR adds a torch_utils class (similiar to tf_utils) and requires importing torch from there. This lets us do a couple things, the first being detect if CUDA is available and set the default tensor type appropriately. Requiring importing torch from here ensures that this will be set before any torch functions are used at all. The PR also adds an is_available() method to torch_utils, and throws a nicer error if torch isn't available when you passed --torch.

In the future we can also set the number of torch threads, etc. from here.

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments

@ervteng ervteng changed the title [feature] Add torch_utils class, auto-detect CUDA availability Add torch_utils class, auto-detect CUDA availability Aug 21, 2020
# Add files or directories to the ignore list. They should be base names, not
# paths.
ignore=CVS
generated-members=torch.*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this isn't the best way to do this, but without it, pylint will complain about torch not having the right members everywhere. I believe this is because torch could be None, though it never actually happens.

setup.cfg Outdated

banned-modules = tensorflow = use mlagents.tf_utils instead (it handles tf2 compat).
logging = use mlagents_envs.logging_util instead
torch = use mlagents.torch_utils istead (handles GPU detection).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
torch = use mlagents.torch_utils istead (handles GPU detection).
torch = use mlagents.torch_utils instead (handles GPU detection).

Copy link
Contributor

@vincentpierre vincentpierre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I would like ml-agents/mlagents/torch_utils/torch.py to contain a comment stating that the file is temporary and will be removed once torch is required. (To avoid adding functionality to this file in the mean time)

@ervteng
Copy link
Contributor Author

ervteng commented Aug 21, 2020

Looks good to me, but I would like ml-agents/mlagents/torch_utils/torch.py to contain a comment stating that the file is temporary and will be removed once torch is required. (To avoid adding functionality to this file in the mean time)

The file isn't temporary (will still be needed for GPU detection) but yeah the try/except is. Added comment.

@dongruoping
Copy link
Contributor

Only tensors are created on GPU but the models are still on CPU.
Should it also be addressed in this PR?

@dongruoping
Copy link
Contributor

Another point missing here is that when running Torch with CuDNN, if we want to get completely reproducible results, we'll also need to set these besides setting seeds for torch and numpy:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

while these could potentially hurt the running performance.

@ervteng
Copy link
Contributor Author

ervteng commented Aug 26, 2020

Another point missing here is that when running Torch with CuDNN, if we want to get completely reproducible results, we'll also need to set these besides setting seeds for torch and numpy:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

while these could potentially hurt the running performance.

How much of a performance hit do we expect by setting these flags?

@dongruoping
Copy link
Contributor

dongruoping commented Aug 26, 2020

How much of a performance hit do we expect by setting these flags?

Not sure, it could be case by case, depending on how much the model changes during training and how much optimization cudnn does.
I haven't really try to quantify the impact, but it's a general warning on PyTorch doc.

@ervteng ervteng requested a review from dongruoping August 27, 2020 18:28
@dongruoping
Copy link
Contributor

dongruoping commented Aug 27, 2020

ml-agents/mlagents/trainers/policy/torch_policy.py: line 61 should be removed. It reset the default tensor type to non-cuda, which is already properly set in utils .

I tested with removing that line. It works well.

@ervteng ervteng merged commit 084d1c8 into master Aug 28, 2020
@delete-merged-branch delete-merged-branch bot deleted the develop-torch_utils branch August 28, 2020 00:36
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants