Skip to content

Conversation

@dongruoping
Copy link
Contributor

Proposed change(s)

torch.onnx.export() can only be used by one thread at a time.
When training multiple behavior with threading=True, multiple thread could be trying to export model at the same time and resulted in exception.
Adding a global lock for model serialization solves the problem.

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments

@vincentpierre
Copy link
Contributor

I think this works, but I would prefer the lock be part of the exporting_to_onnx object if possible.

@chriselion
Copy link
Contributor

I would prefer the lock be part of the exporting_to_onnx

exporting_to_onnx currently gets created locally, so you'll need to refactor some things so that the class gets shared (otherwise each thread will use its own lock, so nothing different actually happens)

@vincentpierre
Copy link
Contributor

exporting_to_onnx currently gets created locally, so you'll need to refactor some things so that the class gets shared (otherwise each thread will use its own lock, so nothing different actually happens)

I was thinking a static lock on the class. Something like :

class exporting_to_onnx:
    _local_data = threading.local()
    _lock = threading.Lock()
    _local_data._is_exporting = False

    def __enter__(self):
        self._lock.__enter__()
        self._local_data._is_exporting = True

    def __exit__(self, *args):
        self._lock.__exit__(*args)
        self._local_data._is_exporting = False

    @staticmethod
    def is_exporting():
        if not hasattr(exporting_to_onnx._local_data, "_is_exporting"):
            return False
        return exporting_to_onnx._local_data._is_exporting

Since it is a static variable, it should be created only once and be shared on all threads no?

@dongruoping
Copy link
Contributor Author

Since it is a static variable, it should be created only once and be shared on all threads no?

I was thinking about the same thing and I just tried it now. It seems to work.

@dongruoping
Copy link
Contributor Author

Tested with match3 with checkpoint_interval=100. Without the fix It usually fail within 1000 steps without the fix.
By adding the lock It is able to run without error for more than 20000 steps so I assume it's working.

Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
@dongruoping dongruoping merged commit 3e48be4 into master Nov 18, 2020
dongruoping pushed a commit that referenced this pull request Nov 18, 2020
dongruoping pushed a commit that referenced this pull request Nov 18, 2020
@dongruoping dongruoping deleted the develop-threadsafe-export branch December 9, 2020 22:58
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants