Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous mode - tick hangs #2809

Open
Vaan5 opened this issue May 4, 2020 · 29 comments
Open

Synchronous mode - tick hangs #2809

Vaan5 opened this issue May 4, 2020 · 29 comments
Assignees
Labels
bug critical issue with high priority
Milestone

Comments

@Vaan5
Copy link
Contributor

Vaan5 commented May 4, 2020

Intro

I am running a simple script in synchronous mode which does the following:

  • spawn an ego vehicle
  • in a loop do:
    • set velocity
    • initiate tick
    • wait to receive data
    • store it.

Here is the code (i've simplified the synchronous mode example):

import carla
import queue


snapshots = queue.Queue()


def push_snapshot(snapshot):
    snapshots.put(snapshot)


def get_position(frame_id, vehicle_id):
    snapshot = snapshots.get(timeout=2)
    assert(frame_id == snapshot.frame)
    transform = snapshot.find(vehicle_id).get_transform()
    return transform.location.x, transform.location.y, transform.location.z


def main():
    sim_time = 0.0
    sim_time_end = 20.0
    delta_time = 0.05
    results = []

    client = carla.Client("127.0.0.1", 2000)
    world = client.get_world()

    # set mode to synchronous
    frame_id = world.apply_settings(carla.WorldSettings(True, False, delta_time))
    print("Synchronous mode will be applied in frame: {}".format(frame_id))

    # spawn car
    spectator = world.get_spectator()
    blueprint_library = world.get_blueprint_library()
    map_ = world.get_map()
    vehicles = blueprint_library.filter("vehicle")
    bp = vehicles[0]
    vehicle = world.spawn_actor(bp, map_.get_spawn_points()[0])
    world.tick()    # just so that I see it in the simulator

    world.on_tick(push_snapshot)

    cmd = [carla.command.ApplyVelocity(vehicle.id, carla.Vector3D(y=2.0))]    # constant for the minimal example
    while sim_time <= sim_time_end:
        client.apply_batch_sync(cmd)
        print("before tick")
        frame_id = world.tick()   # on newer versions raises an Exception from which you can't rly recover
        print("after tick {}".format(frame_id))
        sim_time += 0.05
        results.append((sim_time, get_position(frame_id, vehicle.id)))

    world.apply_settings(carla.WorldSettings(False, False, 0))
    vehicle.destroy()


if __name__ == '__main__':
    main()

Problem

The problem happens when running this multiple times, sometimes the tick call will just hang. I originally encountered this with the C++ API (0.9.7). After that, I was able to reproduce it with the PythonAPI on 0.9.7, 0.9.8, 0.9.9 as well.

In case of 0.9.7, the script would just hang forever. With the newer versions, an exception would be thrown because of the recently added tick timeout (#2556). But in my opinion this just hides the root problem.

In my opinion this is really a big shortcoming of Carla and the synchronous mode, and it basically makes the synchronous mode not that usable. If you want to have reproducible results, it is not possible as the frame (corresponding to the "hanging" tick call) would be dropped. The exception being thrown in newer versions due to a timeout is also not useful as you can't really recover from it.

I've already found a few related issues:

So here are my questions:

  1. Do you agree with me that this is a problem (and can be reproduced)?
  2. Is anything already planned regarding the fix (as I can see in the linked issue that @nsubiron also suggests getting rid of the cause of the problem instead of applying workarounds?

Additional info

IMPORTANT: The described problem doesn't occur all the time. You might have to run the script multiple times to see it. I typically run it like:

FOR /L %v IN (1,1,50) DO python tick_hangs.py

Environment

OS: Win 10
Python: 3.7
Carla versions:

  • 0.9.7 (locally built from 1f7669e) - tried both the python and C++ API
  • 0.9.8 (official package) - tried only the python API
  • 0.9.9 (official package) - tried only the python API
@nsubiron
Copy link
Collaborator

nsubiron commented May 5, 2020

I totally agree, the synchronous mode should not drop images, that breaks significantly the functionality. I've been working on a fix cause this has been bothering me a lot lately, but so far I couldn't get anything that works well (doing it on my free time so couldn't put a lot of effort). The problem is that needs to work both for sync and async modes, and async mode needs to drop images, so it's not so easy to find a solution that works well for both.

@hh0rva1h
Copy link

@nsubiron Would be really great if you could escalate the priority of this issue among devs as this is a major problem for reinforcement learning research with Carla.

@germanros1987 germanros1987 added critical issue with high priority bug labels Jun 25, 2020
@germanros1987 germanros1987 added this to the 0.9.10 milestone Jun 25, 2020
@germanros1987
Copy link
Member

@doterop This issue has high priority, could you and @bernatx take a look at this?

@hh0rva1h
Copy link

hh0rva1h commented Jun 25, 2020

@germanros1987 Thanks for triaging this. We looked again into it last night and found a non documented way to handle it:
@doterop added a timeout for ticks in march: 5d238b2 so the following code works around the problem:

try:
    world.tick(0.1)
except:
    print("WARNING: tick not received")

It seems the tick is done in the simulator just that the client never gets the signal.

Maybe for the time being until this is fixed

  1. In the API it should be documented that world.tick() takes a timeout argument
  2. This issue / workaround should be documented in the sync mode documentation

Note that we also tested to recompile Carla with the workaround proposed in #2070 and it did not reliably workaround the issue.

@nsubiron
Copy link
Collaborator

I can comment on what I've found checking this not long ago.

The origin of the frame drops is here

if (_is_writing) {
log_debug("session", _session_id, ": connection too slow: message discarded");
return;
}

That boolean is set to false when the completion callback is called and the socket starts accepting messages again. In async mode this is fine cause we cannot keep an infinite queue anyway (though it would be nice if the number of drops could be reduced), but in sync mode this can cause a deadlock. So I tried adding a double-buffer for messages, basically a queue of 2 elements, that should have solved the issue in sync mode since the server would hang until the client received the message and you have a buffer of 1 frame that compensates (since render is async you could still overlap), but to my surprise the frame drops still happened. I think one possible cause is that the callback is not called immediately after sending the message and somehow can be postponed, this is handled by the ASIO context, so even though the client already detected and forwarded the message to the API the server session is still marked as "_is_writing" when reading the next message.

@doterop
Copy link
Contributor

doterop commented Jul 7, 2020

Hi @Vaan5,

I've been struggling the last few days trying to reproduce this error. Although all points out in the direction that @nsubiron mentioned before, it would very handy for me to have a scenario where I can reproduce the error and therefore create a proper solution for both sync and async mode.

If you don't have this case, no worries. :)

Thanks!

@Vaan5
Copy link
Contributor Author

Vaan5 commented Jul 9, 2020

Hi, sry for the late reply.

Sadly i don't have a different scenario. The script i wrote in the first post is the smallest I could think of.

I tried it again yesterday with:

Carla 0.9.9.4 (downloaded from the github link)
Python 3.7.7
Windows 10 (sry I don't have an ubuntu so I can't try it on one)

Here is exactly what i did:

  1. Start Carla Server
  2. Open console; add the carla egg to the PYTHONPATH variable
  3. Run python tick_hangs.py // I named the script tick_hangs.py (i just copied the code from my first post into it - without any modifications) 100 times in a loop (FOR /L %v IN (1,1,100) DO python tick_hangs.py)

Out of the 100 runs, it happened once (0 times on the second try, 2 times on the third). So it still happens. Here is an output file of the 100 runs: out.txt (just search for RuntimeError or Error).

Edit: It also occurrs if you those setter-like functions are used instead of commands (like in #2812) - might come in handy for async mode checks.

@hh0rva1h
Copy link

hh0rva1h commented Jul 9, 2020

@doterop Just run the following script:

import carla

# connect to Carla
client = carla.Client("localhost", 2000)
world = client.load_world("Town04")

# set fixed time step length
settings = world.get_settings()
settings.fixed_delta_seconds = 0.05
settings.synchronous_mode = True
settings.no_rendering_mode = True
world.apply_settings(settings)

# do ticks, will deadlock eventually (will timeout after 10 seconds)
while True:
    print("Issuing tick ...")
    world.tick()
    print("Tick successful")

For me it takes less than a minute to trigger the issue:

Issuing tick ...
Tick successful
Issuing tick ...
Tick successful
Issuing tick ...
Traceback (most recent call last):
  File "tick_hang.py", line 17, in <module>
    world.tick()
RuntimeError: time-out of 10000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:2000

How easy it is to trigger that issue might depend on the hardware: on my laptop with integrated graphics only this is very easy trigger, on very fast machines with dedicated gpu it can take long to trigger.

@stale
Copy link

stale bot commented Sep 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issue has not had recent activity label Sep 11, 2020
@hh0rva1h
Copy link

@bernatx Where did this end up in? I failed to find the commit in master / dev, but the pull request ist closed?

@stale stale bot removed the stale Issue has not had recent activity label Sep 12, 2020
@hh0rva1h
Copy link

hh0rva1h commented Oct 2, 2020

@germanros1987 @bernatx I tested release 0.9.10 for the issue and it's still happening. Did the linked commit not make it into the release?

@bernatx
Copy link
Contributor

bernatx commented Oct 2, 2020

@hh0rva1h I have been running the script several times, each time for 50 times running around 20 seconds, and never happened the problem. I have tried in Town03 and also in Town04.
I'm using version 0.9.10.1 where we add a fix for a camera freeze we detected, but I don't think it is related to this.
I will do more tests in Ubuntu also and let you know.
In the meantime, can you check also that 0.9.10.1 version?

@hh0rva1h
Copy link

hh0rva1h commented Oct 2, 2020

@bernatx Thanks for letting me know, I fetched 0.9.10.1 now, but it's still happening to me:
https://youtu.be/X3WZ-tmltWo
I guess 20 seconds are not long enough, it can take time until the problem occurs, also keep in mind that the problem is harder to trigger on beefy machines. I am reproducing this on an XPS 13 9360 from 2016, a dual core i5 machine with no dedicated graphics.

@bernatx
Copy link
Contributor

bernatx commented Oct 2, 2020

ok, I have been able to reproduce using a package a let the script a bit more of time to run. It happens at random, but now I can try to check further. I will tell you what I find, thank you.

@bernatx
Copy link
Contributor

bernatx commented Oct 5, 2020

Hi, I have opened a PR fixing this problem (#3394).
It seems it was a race condition and a wrong management of a handler, that was queued instead of executed in real time.

@hh0rva1h
Copy link

@bernatx Tried the nightly build, looks good so far. Thanks very much!

@nsubiron
Copy link
Collaborator

Left some comments on #3394, it fixes this issue, but I think introduces other issues.

@Vaan5
Copy link
Contributor Author

Vaan5 commented Dec 1, 2020

Probably isn't of much help (as I'm quite late): tried it with 0.9.10 and I couldn't reproduce it anymore.

@varunjammula
Copy link

This behavior is not deterministic. The tick hang occurs most of the times but the intended behavior magically works only some times

@stale
Copy link

stale bot commented Apr 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issue has not had recent activity label Apr 26, 2021
@egeonat
Copy link

egeonat commented May 7, 2021

I'm running CARLA version 0.9.11 on Ubuntu and I am still experiencing this issue in synchronous mode with 1/20 fixed_delta_seconds. Is there any progress? Were the issues specified in the comments by @nsubiron fixed?

@stale stale bot removed the stale Issue has not had recent activity label May 7, 2021
@varunjammula
Copy link

varunjammula commented May 7, 2021

@egeonat Did you try the synchronous_mode in the PythonAPI examples folder? They changed the way the synchronous mode works, I think. If you follow the example, it should work and this issue will not happen again. (https://github.com/carla-simulator/carla/blob/master/PythonAPI/examples/synchronous_mode.py)

@egeonat
Copy link

egeonat commented May 7, 2021

@varunjammula I am trying to use CARLA for RL training so my code has some differences, but I have mostly followed the synchronous_mode.py example. I believe my problem might be slightly different than the original poster, as when I call world.tick() the call doesn't timeout. It simply hangs forever. That is why I was wondering if perhaps new issues were introduced to the simulator with the PR, as suggested by nsublron.

@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issue has not had recent activity label Jul 21, 2021
@hh0rva1h
Copy link

@egeonat You can specify a timeout for world.tick, e.g. world.tick(1), then it wont simpy hang forever. You can catch that with a try block.

@stale stale bot removed the stale Issue has not had recent activity label Aug 19, 2021
@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issue has not had recent activity label Jan 9, 2022
@Julbak
Copy link

Julbak commented Mar 29, 2023

Jumped to the version 0.9.14 and managed to hang the server, task manager and closing the process works. Exactly the same code worked with 0.9.13. I'm using the quick start package installation as the other option was painful beyond anything.

@stale stale bot removed the stale Issue has not had recent activity label Mar 29, 2023
@stale
Copy link

stale bot commented Jun 1, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issue has not had recent activity label Jun 1, 2023
@SExpert12
Copy link

Intro

I am running a simple script in synchronous mode which does the following:

  • spawn an ego vehicle

  • in a loop do:

    • set velocity
    • initiate tick
    • wait to receive data
    • store it.

Here is the code (i've simplified the synchronous mode example):

import carla
import queue


snapshots = queue.Queue()


def push_snapshot(snapshot):
    snapshots.put(snapshot)


def get_position(frame_id, vehicle_id):
    snapshot = snapshots.get(timeout=2)
    assert(frame_id == snapshot.frame)
    transform = snapshot.find(vehicle_id).get_transform()
    return transform.location.x, transform.location.y, transform.location.z


def main():
    sim_time = 0.0
    sim_time_end = 20.0
    delta_time = 0.05
    results = []

    client = carla.Client("127.0.0.1", 2000)
    world = client.get_world()

    # set mode to synchronous
    frame_id = world.apply_settings(carla.WorldSettings(True, False, delta_time))
    print("Synchronous mode will be applied in frame: {}".format(frame_id))

    # spawn car
    spectator = world.get_spectator()
    blueprint_library = world.get_blueprint_library()
    map_ = world.get_map()
    vehicles = blueprint_library.filter("vehicle")
    bp = vehicles[0]
    vehicle = world.spawn_actor(bp, map_.get_spawn_points()[0])
    world.tick()    # just so that I see it in the simulator

    world.on_tick(push_snapshot)

    cmd = [carla.command.ApplyVelocity(vehicle.id, carla.Vector3D(y=2.0))]    # constant for the minimal example
    while sim_time <= sim_time_end:
        client.apply_batch_sync(cmd)
        print("before tick")
        frame_id = world.tick()   # on newer versions raises an Exception from which you can't rly recover
        print("after tick {}".format(frame_id))
        sim_time += 0.05
        results.append((sim_time, get_position(frame_id, vehicle.id)))

    world.apply_settings(carla.WorldSettings(False, False, 0))
    vehicle.destroy()


if __name__ == '__main__':
    main()

Problem

The problem happens when running this multiple times, sometimes the tick call will just hang. I originally encountered this with the C++ API (0.9.7). After that, I was able to reproduce it with the PythonAPI on 0.9.7, 0.9.8, 0.9.9 as well.

In case of 0.9.7, the script would just hang forever. With the newer versions, an exception would be thrown because of the recently added tick timeout (#2556). But in my opinion this just hides the root problem.

In my opinion this is really a big shortcoming of Carla and the synchronous mode, and it basically makes the synchronous mode not that usable. If you want to have reproducible results, it is not possible as the frame (corresponding to the "hanging" tick call) would be dropped. The exception being thrown in newer versions due to a timeout is also not useful as you can't really recover from it.

I've already found a few related issues:

So here are my questions:

  1. Do you agree with me that this is a problem (and can be reproduced)?
  2. Is anything already planned regarding the fix (as I can see in the linked issue that @nsubiron also suggests getting rid of the cause of the problem instead of applying workarounds?

Additional info

IMPORTANT: The described problem doesn't occur all the time. You might have to run the script multiple times to see it. I typically run it like:

FOR /L %v IN (1,1,50) DO python tick_hangs.py

Environment

OS: Win 10 Python: 3.7 Carla versions:

  • 0.9.7 (locally built from 1f7669e) - tried both the python and C++ API
  • 0.9.8 (official package) - tried only the python API
  • 0.9.9 (official package) - tried only the python API

Hi,

I want to retrieve data after every 10ms in my code. In this code I am not able to understand that why there is world_tick() and world_on_tick() after the vehicle spawn statement is written?
Can any one help me to understand?

How to write code if I need to update vehicle location after every 10 ms, so I can change the car crossing order?

Thanks in advance

@stale stale bot removed the stale Issue has not had recent activity label Dec 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug critical issue with high priority
Projects
None yet
Development

No branches or pull requests

10 participants