Skip to content

Conversation

@chriselion
Copy link
Contributor

@chriselion chriselion commented Dec 6, 2019

In #2988, that was raising with this callstack:

Traceback (most recent call last):
  File "/Users/chris.elion/code/ml-agents/venv/bin/mlagents-learn", line 11, in <module>
    load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')()
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 422, in main
    run_training(0, run_seed, options, Queue())
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/learn.py", line 266, in run_training
    tc.start_learning(env)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 209, in start_learning
    n_steps = self.advance(env_manager)
  File "/Users/chris.elion/code/ml-agents/ml-agents-envs/mlagents/envs/timers.py", line 262, in wrapped
    return func(*args, **kwargs)
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 283, in advance
    step_info.brain_name_to_action_info[brain_name].outputs,
  File "/Users/chris.elion/code/ml-agents/ml-agents/mlagents/trainers/rl_trainer.py", line 131, in add_experiences
    curr_to_use, take_action_outputs["action"], next_info
TypeError: 'NoneType' object is not subscriptable

The actual error isn't that important, but the thing to note is that the worker process stuck was waiting for the main process :

cmd: EnvironmentCommand = parent_conn.recv()

(I think, may have been somewhere different).
This means that the process never exited and basically deadlocked.

This change tries to call EnvManager.close() even if training raised an exception. I confirmed by undoing the fix from #2988 that this at least lets python exit and Unity stops playing (instead of having a spinning beach ball).

@chriselion chriselion requested a review from harperj December 6, 2019 02:12
assert tc.advance.call_count == 11
tc._export_graph.assert_not_called()
tc._save_model.assert_not_called()
env_mock.close.assert_called_once()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() isn't being called here anymore. I think it's cleaner to have it in the trainer_controller since that's what creates it in the first place.

@chriselion chriselion requested a review from ervteng December 6, 2019 02:16
Copy link
Contributor

@harperj harperj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks for this fix, looks good!

@chriselion chriselion merged commit 40d74dc into master Dec 6, 2019
@delete-merged-branch delete-merged-branch bot deleted the develop-MLA-413-dont-deadlock branch December 6, 2019 04:26
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants