-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronous mode - tick hangs #2809
Comments
I totally agree, the synchronous mode should not drop images, that breaks significantly the functionality. I've been working on a fix cause this has been bothering me a lot lately, but so far I couldn't get anything that works well (doing it on my free time so couldn't put a lot of effort). The problem is that needs to work both for sync and async modes, and async mode needs to drop images, so it's not so easy to find a solution that works well for both. |
@nsubiron Would be really great if you could escalate the priority of this issue among devs as this is a major problem for reinforcement learning research with Carla. |
@germanros1987 Thanks for triaging this. We looked again into it last night and found a non documented way to handle it:
It seems the tick is done in the simulator just that the client never gets the signal. Maybe for the time being until this is fixed
Note that we also tested to recompile Carla with the workaround proposed in #2070 and it did not reliably workaround the issue. |
I can comment on what I've found checking this not long ago. The origin of the frame drops is here carla/LibCarla/source/carla/streaming/detail/tcp/ServerSession.cpp Lines 80 to 83 in baf43b0
That boolean is set to false when the completion callback is called and the socket starts accepting messages again. In async mode this is fine cause we cannot keep an infinite queue anyway (though it would be nice if the number of drops could be reduced), but in sync mode this can cause a deadlock. So I tried adding a double-buffer for messages, basically a queue of 2 elements, that should have solved the issue in sync mode since the server would hang until the client received the message and you have a buffer of 1 frame that compensates (since render is async you could still overlap), but to my surprise the frame drops still happened. I think one possible cause is that the callback is not called immediately after sending the message and somehow can be postponed, this is handled by the ASIO context, so even though the client already detected and forwarded the message to the API the server session is still marked as "_is_writing" when reading the next message. |
Hi @Vaan5, I've been struggling the last few days trying to reproduce this error. Although all points out in the direction that @nsubiron mentioned before, it would very handy for me to have a scenario where I can reproduce the error and therefore create a proper solution for both sync and async mode. If you don't have this case, no worries. :) Thanks! |
Hi, sry for the late reply. Sadly i don't have a different scenario. The script i wrote in the first post is the smallest I could think of. I tried it again yesterday with:
Here is exactly what i did:
Out of the 100 runs, it happened once (0 times on the second try, 2 times on the third). So it still happens. Here is an output file of the 100 runs: out.txt (just search for RuntimeError or Error). Edit: It also occurrs if you those setter-like functions are used instead of commands (like in #2812) - might come in handy for async mode checks. |
@doterop Just run the following script:
For me it takes less than a minute to trigger the issue:
How easy it is to trigger that issue might depend on the hardware: on my laptop with integrated graphics only this is very easy trigger, on very fast machines with dedicated gpu it can take long to trigger. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@bernatx Where did this end up in? I failed to find the commit in master / dev, but the pull request ist closed? |
@germanros1987 @bernatx I tested release 0.9.10 for the issue and it's still happening. Did the linked commit not make it into the release? |
@hh0rva1h I have been running the script several times, each time for 50 times running around 20 seconds, and never happened the problem. I have tried in Town03 and also in Town04. |
@bernatx Thanks for letting me know, I fetched 0.9.10.1 now, but it's still happening to me: |
ok, I have been able to reproduce using a package a let the script a bit more of time to run. It happens at random, but now I can try to check further. I will tell you what I find, thank you. |
Hi, I have opened a PR fixing this problem (#3394). |
@bernatx Tried the nightly build, looks good so far. Thanks very much! |
Left some comments on #3394, it fixes this issue, but I think introduces other issues. |
Probably isn't of much help (as I'm quite late): tried it with 0.9.10 and I couldn't reproduce it anymore. |
This behavior is not deterministic. The tick hang occurs most of the times but the intended behavior magically works only some times |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'm running CARLA version 0.9.11 on Ubuntu and I am still experiencing this issue in synchronous mode with 1/20 fixed_delta_seconds. Is there any progress? Were the issues specified in the comments by @nsubiron fixed? |
@egeonat Did you try the synchronous_mode in the PythonAPI examples folder? They changed the way the synchronous mode works, I think. If you follow the example, it should work and this issue will not happen again. (https://github.com/carla-simulator/carla/blob/master/PythonAPI/examples/synchronous_mode.py) |
@varunjammula I am trying to use CARLA for RL training so my code has some differences, but I have mostly followed the synchronous_mode.py example. I believe my problem might be slightly different than the original poster, as when I call |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@egeonat You can specify a timeout for |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Jumped to the version 0.9.14 and managed to hang the server, task manager and closing the process works. Exactly the same code worked with 0.9.13. I'm using the quick start package installation as the other option was painful beyond anything. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, I want to retrieve data after every 10ms in my code. In this code I am not able to understand that why there is world_tick() and world_on_tick() after the vehicle spawn statement is written? How to write code if I need to update vehicle location after every 10 ms, so I can change the car crossing order? Thanks in advance |
Intro
I am running a simple script in synchronous mode which does the following:
Here is the code (i've simplified the synchronous mode example):
Problem
The problem happens when running this multiple times, sometimes the tick call will just hang. I originally encountered this with the C++ API (0.9.7). After that, I was able to reproduce it with the PythonAPI on 0.9.7, 0.9.8, 0.9.9 as well.
In case of 0.9.7, the script would just hang forever. With the newer versions, an exception would be thrown because of the recently added tick timeout (#2556). But in my opinion this just hides the root problem.
In my opinion this is really a big shortcoming of Carla and the synchronous mode, and it basically makes the synchronous mode not that usable. If you want to have reproducible results, it is not possible as the frame (corresponding to the "hanging" tick call) would be dropped. The exception being thrown in newer versions due to a timeout is also not useful as you can't really recover from it.
I've already found a few related issues:
So here are my questions:
Additional info
IMPORTANT: The described problem doesn't occur all the time. You might have to run the script multiple times to see it. I typically run it like:
Environment
OS: Win 10
Python: 3.7
Carla versions:
The text was updated successfully, but these errors were encountered: