-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diambra run -s will hang in parallel environments. #64
Comments
@amit-gshe can you please add a description of your environment:
|
|
The docker version I'am using:
|
Here is the fulllog of 3 envs:
|
@amit-gshe Can you check the logs of the docker containers? |
I suspect one of the mame processes crashed while engine didn't (probably another case where we need to improve error handling in engine) |
I just tested it and it seems that if the execution of mame fails at startup, the engine crashes too (throws runtime_error exception). It is true though that if the emulator crashes after startup (after it has been initialized) the engine is not able to detect the fail. |
@discordianfish @amit-gshe do you know a way to replicate a slow internet connetion for reproducing this problem? I am not completely sure this is the problem though, as in case the authentication step fails (for whatever reason, including timeout in the API request), the docker container should fail and return an error message, isn't it @discordianfish ? |
If I understand @amit-gshe correctly, the theory is that after 'stored credentials found' engine will try to validate them but not timeout. @alexpalms Should it timeout? Can we make it retry (a few times maybe with exponential backoff ideally). You can probably reproduce somthing similar adding the api domain to your /etc/hosts with some unreachable IP. |
@discordianfish I understood the same thing, and I think it should timeout, but not sure. Thanks for the suggestion on how to replicate it, I will try it locally. In the meantime @amit-gshe, I just pushed a new engine that better handles emulator crashes, can you retry with it? It should be automatically pulled by docker (tag: v2.1.0-rc14) when you run scripts using our command line interface (e.g. |
@alexpalms I tried the latest engine and now the validation issue has gone away. Now if I start 8 envs and all 8 mame.real process created successfully. But some mame process will hang after the log: After a long long time, those abnormal engine containers fails with logs:
|
@amit-gshe thank you for the feedback, it is interesting (and strange). As a final test step, could you please retry now? As we just released a new engine version that, together with the previous mod, better handles the license auth timeout for slow internet connections. It should be automatically pulled by docker (tag: v2.1.0-rc15) when you run scripts using our command line interface (e.g. diambra run python script.py) |
@alexpalms |
@amit-gshe thanks a lot for the feedback. Good you could confirm the authentication is now more robust. Regarding the CPU problem that is preventing training to start, I would like you to ask an additional test: could you test a few other games, in particular
and see if the problem is still there also for them? |
@alexpalms I tried several other games above(doapp sfiii3n tektagt), all have the same problem. |
@amit-gshe ok thanks for this additional test. We will review the whole thread and your inputs to gain some insights on what can be happening and how to reproduce. We will keep you posted. In case you will have additional clues or elements, do not hesitate to post additional comments here! |
Hey @amit-gshe, I worked a bit on the engine to improve startup speed and robustness. I pushed a new custom engine in my personal dockerhub so that you can test it. The engine docker image is called
Note that you have to remove the It would be great if you could use it to test your systems with a few of the games, so Looking forward to hearing your feedback! |
@alexpalms Thank you for your work, I just tried the image you provided, and the problem of some containers cpu has been 100% is still not resolved. The container log is the same as the one provided above. |
@amit-gshe can you post the full log containing the final error of the containers? I would like to see the error |
@amit-gshe I just pushed a new version of the same engine docker image |
@alexpalms Sorry for taking so long to reply you. I tried your latest image and now the hung container never seems to exit with an error. Below is the output of docker stats and all script logs
|
@amit-gshe Thanks a lot for your detailed feedback. It is really hard to say what is going wrong. I tested your training script using 10 and 20 envs in my local machine that has 6 cpus/12 threads and 64 Gb of RAM and both worked fine: Not being able to replicate the problem makes hard spotting the issue here. I would like to ask you the two following things for the hanging containers with the high CPU load:
Reasons being, I would like to understand which process is causing the CPU freeroll and also make sure the named fifo pipes are being created, as I suspect this is what is blocking it. |
@alexpalms I entered into the abnormal container and followed your instructions, the following is the result of the command |
@amit-gshe thanks for the feedback. So the In addition, I just pushed a new image that has some more debug log info, it is named
|
@alexpalms It's hard to install htop in the container because the bin dir dose not contains the apt-get command: |
@alexpalms Hello, When I was about to install linux-perf to analyze the problem of 100% CPU usage, I found that the kernel I was using did not have a corresponding version of the linux-perf package. So when I tried another version of the kernel, the problem disappeared. The kernel version I used before was linux-headers-5.18.17-amd64-desktop-community-hwe. When I switched to linux-image-5.15.77-amd64-desktop, I could start training smoothly. This is a bit strange. When I found that the problem was with the kernel, I tried another kernel: linux-headers-5.18.17-amd64-desktop-hwe, and the training started normally. So I started to compare the differences between the two kernels linux-headers-5.18.17-amd64-desktop-hwe and linux-headers-5.18.17-amd64-desktop-community-hwe. The following is the .config file diff compared between the two kernels linux-headers-5.18.17-amd64-desktop-hwe and linux-headers-5.18.17-amd64-desktop-community-hwe. There are several configurations about CGROUP here, but I don't understand these configurations very well, and it is not clear whether these configurations prevent the creation of the env container. Anyway, after I switched kernels, I was able to start training without any problems. Of course I am also happy to provide further information on the above issues. |
@amit-gshe this is so interesting! And you did a hell of a debug! Thanks a lot. I am not familiar with these aspects too, maybe @discordianfish can spot something about this subtle behavior you spotted? |
@amit-gshe I did the additional test I wanted to do and everything looks fine. So this new engine seems ready to be merged. It will be released in the coming days after a few tests. Thanks a lot for your support. I will keep this issue open in case we better understand what happened, but I am happy you managed to solve the issue and can run the environments smoothly! |
Yeah really hard to say.. but yeah probably some kernel bug? For the cgroup stuff things should either work or not, not cause this behavior. But good that it's fixed! Let's close this issue, it will be still around and we can re-open if it happens to other people as well. |
Log:
I inspect the top command and it seems that 2 container processes may get in dead lock so can't respond to the client.
The text was updated successfully, but these errors were encountered: