-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The training process terminated unexpectedly #79
Comments
@amit-gshe thanks for opening this issue. This is a problem with the engine/environment, not with your GPU. It implements some checks to make sure it does not get stuck in It would be very useful if you could share additional information so that we can try to replicate this. In particular:
Looking forward to your reply to optimize the search for the bug! |
@amit-gshe here is a first update: I have been able to spot a problem in KOF. I am not 100% sure this is what is causing your problem, but it is very probable. The good news is that I am able to replicate it at will, this means that the bugfix should not be hard to find. In addition, this will be totally retrocompatible, so you will be able to keep everything you already have. The other good news is that it allowed me to implement a robust verification procedure for all environments, that will prevent similar issues to happen in the future. I will keep you posted. |
Hello, @alexpalms I tried it a few more times, and the problem disappeared strangely. I checked the tensorboard log and found that the fps dropped significantly. The previous fps was about 200/s in the issue description above, but now when the same concurrent container is started, the fps is only 80/s. When using GPU and CPU for training, there is no difference in the fps value, but the CPU usage is significantly reduced, and the training load comes to the GPU. But for now even over 1M time steps, the training does not drop out unexpectedly. For your questions above:
|
@amit-gshe Perfect, thanks a lot for the feedback. Stage 4 is exactly where the problem I mentioned before happens, I already fixed it. I am about to push a new engine, just completing the final checks. I will post an explanation also for the FPS (that is normal and expected) in the next message to confirm the closure of the issue. The fact it has not happened is normal, it has same randomness, but my fix will solve it. Thanks a lot for your feedback. Will update later here. |
@amit-gshe I just pushed the new engine that integrates the fix needed for KOF in stage 4. I am confident this will prevent this failure to happen in the future. The engine was not handling properly the fact that there is a single opponent in that stage. Regarding the frame rate difference you noticed: when the game transitions between rounds or stages, it runs faster than when it is in the combat phase. The bug that caused your error was keeping the engine stuck in a transitioning condition thus letting it iterate faster. But the correct frame rate is the one you saw under normal functioning, that is slower than the transitioning speed. You will receive the new image automatically at the next execution of KOF is the most recently added game, so it has been tested less than the others. For this reason all the issues you opened have been very useful to improve its robustness, thanks a lot! To fix this bug, I managed to implement an additional automated test that will now be run on all new games, that improves a lot the robustness. I am closing this issue, but do not hesitate to let us know in case you encounter other problems. |
Hello, when I used my nvidia P106-100 for training, when the training reached nearly 1M steps, it exited abnormally. I am not sure whether it is the problem of my graphics card or other reasons. The following is the relevant log.
Command:
diambra run -s 6 -d -n python3 kof.py
Diambra Engine Image:
DIGEST:sha256:3ef428b6c827b3b36120019ca6507c4381487f8fc2581b07c6dc910e41c2846a
My Host Setup:
The text was updated successfully, but these errors were encountered: