Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't start socket communication because worker number 0 is still in use. #1505

Closed
taylerallen6 opened this issue Dec 18, 2018 · 27 comments
Closed
Assignees
Labels
bug Issue describes a potential bug in ml-agents.

Comments

@taylerallen6
Copy link

taylerallen6 commented Dec 18, 2018

I am using Ubuntu 16.04. While going through the getting-started.ipynb I can run the script once and it work just fine, but the second time I try to run it I get this error:

Python version:
3.6.7 (default, Oct 21 2018, 04:56:05) 
[GCC 5.4.0 20160609]
Traceback (most recent call last):
  File "/home/taylerallen6/Documents/Unity_ml_testing1/ml_python3-6_test1/src/ml-agents-master/ml-agents/mlagents/envs/rpc_communicator.py", line 68, in check_port
    s.bind(("localhost", port))
OSError: [Errno 98] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test1.py", line 20, in <module>
    env = UnityEnvironment(file_name=env_name, worker_id=0, seed=1)
  File "/home/taylerallen6/Documents/Unity_ml_testing1/ml_python3-6_test1/src/ml-agents-master/ml-agents/mlagents/envs/environment.py", line 49, in __init__
    self.communicator = self.get_communicator(worker_id, base_port)
  File "/home/taylerallen6/Documents/Unity_ml_testing1/ml_python3-6_test1/src/ml-agents-master/ml-agents/mlagents/envs/environment.py", line 212, in get_communicator
    return RpcCommunicator(worker_id, base_port)
  File "/home/taylerallen6/Documents/Unity_ml_testing1/ml_python3-6_test1/src/ml-agents-master/ml-agents/mlagents/envs/rpc_communicator.py", line 43, in __init__
    self.create_server()
  File "/home/taylerallen6/Documents/Unity_ml_testing1/ml_python3-6_test1/src/ml-agents-master/ml-agents/mlagents/envs/rpc_communicator.py", line 49, in create_server
    self.check_port(self.port)
  File "/home/taylerallen6/Documents/Unity_ml_testing1/ml_python3-6_test1/src/ml-agents-master/ml-agents/mlagents/envs/rpc_communicator.py", line 70, in check_port
    raise UnityWorkerInUseException(self.worker_id)
mlagents.envs.exception.UnityWorkerInUseException: Couldn't start socket communication because worker number 0 is still in use. You may need to manually close a previously opened environment or use a different worker number.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/taylerallen6/Documents/Unity_ml_testing1/ml_python3-6_test1/src/ml-agents-master/ml-agents/mlagents/envs/environment.py", line 423, in _close
    self.communicator.close()
AttributeError: 'UnityEnvironment' object has no attribute 'communicator'

It's like the env.close() isn't actually closing it.
Here is the script I am running:

env_name = "../envs/3dball1"  # Name of the Unity environment binary to launch
train_mode = True  # Whether to run the environment in training or inference mode

import matplotlib.pyplot as plt
import numpy as np
import sys

from mlagents.envs import UnityEnvironment

# %matplotlib inline

print("Python version:")
print(sys.version)

# check Python version
if (sys.version_info[0] < 3):
    raise Exception("ERROR: ML-Agents Toolkit (v0.3 onwards) requires Python 3")


env = UnityEnvironment(file_name=env_name, worker_id=0, seed=1)

# Set the default brain to work with
default_brain = env.brain_names[0]
brain = env.brains[default_brain]


# Reset the environment
env_info = env.reset(train_mode=train_mode)[default_brain]

# Examine the state space for the default brain
print("Agent state looks like: \n{}".format(env_info.vector_observations[0]))

# Examine the observation space for the default brain
for observation in env_info.visual_observations:
    print("Agent observations look like:")
    if observation.shape[3] == 3:
        plt.imshow(observation[0,:,:,:])
    else:
        plt.imshow(observation[0,:,:,0])


for episode in range(10):
    env_info = env.reset(train_mode=train_mode)[default_brain]
    done = False
    episode_rewards = 0
    while not done:
        action_size = brain.vector_action_space_size
        if brain.vector_action_space_type == 'continuous':
            env_info = env.step(np.random.randn(len(env_info.agents), action_size[0]))[default_brain]
        else:
            action = np.column_stack([np.random.randint(0, action_size[i], size=(len(env_info.agents))) for i in range(len(action_size))])
            env_info = env.step(action)[default_brain]
            episode_rewards += env_info.rewards[0]
        done = env_info.local_done[0]
    print("Total reward this episode: {}".format(episode_rewards))


env.close()

../envs/3dball1 is of course the executable for my platform. Its just the 3DBall scene with a 3DBallLearning brain. No changes. Any ideas?

@vincentpierre vincentpierre added the needs-info Issue contains insufficient information to be resolved. label Dec 18, 2018
@vincentpierre vincentpierre self-assigned this Dec 18, 2018
@vincentpierre
Copy link
Contributor

I tried to reproduce your error on OSX but the environment closes as expected on version v0.6. What version are you using ? If you are using v0.6, this could be a Linux specific error.

@taylerallen6
Copy link
Author

It's the latest version. I just download and installed it yesterday. Any advice?

@xiaomaogy
Copy link
Contributor

Hi @taylerallen6 , in your line env = UnityEnvironment(file_name=env_name, worker_id=0, seed=1), you need to specify another worker_id if the previous UnityEnvironment didn't quit properly. Your error message OSError: [Errno 98] Address already in use seems to indicate that your previous unity environment didn't close properly.

@taylerallen6
Copy link
Author

I understand that but opening multiple workers isn't a solution. Then I'd just have multiple workers that dont get closed out.

@xiaomaogy xiaomaogy added help-wanted Issue contains request for help or information. and removed needs-info Issue contains insufficient information to be resolved. labels Dec 18, 2018
@xiaomaogy
Copy link
Contributor

xiaomaogy commented Dec 19, 2018

OK, I was able to reproduce this bug.

This is a bug only happens under Linux platform, I've tried the same under Mac and this bug doesn't occur.

Way to reproduce:

When running

env = UnityEnvironment(file_name=env_name, worker_id=0, seed=1)
env.close() together for twice, we will see this bug. Running it separately won't cause this bug to happen.

I will log the bug for now.

@xiaomaogy xiaomaogy added bug Issue describes a potential bug in ml-agents. and removed help-wanted Issue contains request for help or information. labels Dec 19, 2018
@taylerallen6
Copy link
Author

Ok thanks

@ivan-v-kush
Copy link

@xiaomaogy
any news how to fix this bug?

@zheyangshi
Copy link

@xiaomaogy Hi! I tried to run the PPO2 through gym on Windows 10, but also had the problem.
"mlagents.envs.exception.UnityWorkerInUseException: Couldn't start socket communication because worker number 0 is still in use. You may need to manually close a previously opened environment or use a different worker number."

@ervteng
Copy link
Contributor

ervteng commented Feb 21, 2019

Hi all, this is actually a normal behavior with sockets on most platforms. When a socket is closed, it enters a TIME_WAIT state: https://stackoverflow.com/questions/337115/setting-time-wait-tcp. By default, Ubuntu sets this time to 60 seconds. So a minute later, the socket is released from TIME_WAIT and you'll be able to open the environment again.

We are looking for a workaround for ml-agents. There are ways to shorten TIME_WAIT in your system. Also, one workaround for Linux is to add s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) after the socket creation in check_port(self, port) in rpc_communicator.py. However, this causes undesirable behavior on other platforms, so we won't be using it in the code.

@zheyangshi
Copy link

@ervteng Thanks a lot!

However, I still have the same problem when running the code after maybe 20hours. It reminded me - “Couldn't start socket communication because worker number 0 is still in use”.

@ervteng
Copy link
Contributor

ervteng commented Feb 21, 2019

Hi @zheyangshi, what library are you using to run ML-Agents?

Also, I've edited my comment to add a workaround for Ubuntu specifically. You can give it a try.

@zheyangshi
Copy link

zheyangshi commented Feb 21, 2019

@ervteng Thank you very much, and I will try it later.

I just directly ran the code about PPO2 on https://github.com/Unity-Technologies/ml-agents/blob/master/gym-unity/README.md , and received the error on Win10. What's more, I am a little confused because the code of DQN on the same page can be run successfully.

@ervteng
Copy link
Contributor

ervteng commented Mar 5, 2019

Hey @zheyangshi , were you able to run the code?

It seems that the issue isn't with ML-Agents if DQN does work. Does PPO2 run on, e.g. Cartpole or Atari?

@zheyangshi
Copy link

zheyangshi commented Mar 6, 2019

Hi @ervteng, I tried it on this morning and It still cannot work as expected.
As for other environments, I would try it later. Thanks a lot.

ps: By running "python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4", ppo2 works under this circumstance.

@ervteng
Copy link
Contributor

ervteng commented Mar 6, 2019

Hey @zheyangshi, I'm assuming you've rebooted your machine since last week. If not, that may help.

Also, how many parallel environments are you running with ppo2? Make sure the rank param is being incremented properly in your run code.

@zheyangshi
Copy link

zheyangshi commented Mar 12, 2019

Hello @ervteng, actually I have rebooted the computer but I still got the same result.

I also change the number of environments from 1 to 4, and it seemed not to be worked. What'more, would you mind explaining what is the meaning of “Make sure the rank param is being incremented properly in your run code.”

Thanks a lot!

@ivan-v-kush
Copy link

my workaround is to store last id in the file. The problem is solved

self.env = ObstacleTowerEnv('/home/df/sources/obstacle-tower-challenge/ObstacleTower/obstacletower.x86_64', 
        worker_id=get_worker_id(), retro=False)


def get_worker_id(filename="worker_id.dat"):
    with open(filename, 'a+') as f:
        f.seek(0)
        val = int(f.read() or 0) + 1
        f.seek(0)
        f.truncate()
        f.write(str(val))
        return val

@zheyangshi
Copy link

zheyangshi commented Mar 12, 2019

@ivankush, thank you very much! I think it could be a nice solution for me.

@ervteng
Copy link
Contributor

ervteng commented Mar 12, 2019

Hey @zheyangshi, the baselines PPO2 code uses make_env(rank) to create the environment, which we overrode in make_unity_env. Last I checked, each environment will be passed a unique rank value, which we can use as our worker_id. But the baselines code changes all the time - not sure if this was true. You might be able to debug this by printing out the rank value in that function, and making sure they're all unique.

@ivankush, thanks for the workaround! That should work as well. Anything that will ensure the worker_id is unique between environments.

@zheyangshi
Copy link

@ervteng Thanks for your kind reply. Finally, I got your points.

@awjuliani
Copy link
Contributor

Hi all. We have recently reworked the trainer code as of v0.8. Due to inactivity I am closing this issue. Please let us know if you still run into this issue in the latest version.

@joobei
Copy link

joobei commented Sep 3, 2019

I am still experiencing this issue in ubuntu 18.04, ml-agents 0.9.1. I have to wait sometime before I re-run mlagents-learn.

@AsadJeewa
Copy link

AsadJeewa commented Oct 29, 2019

I am still experiencing this issue as well on Linux Build. Works when I change the base-port but is there any way to manually force close the previous env?

@thepycoder
Copy link

+1

@yijiezh
Copy link

yijiezh commented May 28, 2020

Got the same issue, is it resolved?

@AsadJeewa
Copy link

I updated ml-agents and am using Windows now and I have not been running into this issue

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue describes a potential bug in ml-agents.
Projects
None yet
Development

No branches or pull requests