Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Rendering in pybullet #1545

Closed
BlGene opened this issue Feb 2, 2018 · 16 comments

Comments

@BlGene
Copy link
Contributor

commented Feb 2, 2018

For RL I would like to have pybullet render as fast as possible. How would I best achieve this?
I tested the three connection modes with ER_BULLET_HARDWARE_OPENGL and got the following fps:

DIRECT +shadow 29 ( defaults to ER_TINY_RENDERER?)
DIRECT -shadow 47 (defaults to ER_TINY_RENDERER?)
SHARED_MEMORY +shadow 29 ( using App_PhysicsServer_SharedMemory)
SHARED_MEMORY -shadow 47 ( using App_PhysicsServer_SharedMemory)
(to test FPS times I used: https://gist.github.com/BlGene/9010fb3671dac3cc566ee659fcda155a)

The GUI or SHARED_MEMORY + App_ExampleBrowser connections run very slowly for me <10fps. This seems to work better others, a youtube video (https://www.youtube.com/watch?v=__ilQzkPDNI) mentions the ExampleBrowser running at 1000fps.

The difference may be due to pybullet is selecting the wrong GPU on my machine (NVS 310 vs GeForce GTX 1070). Is there a way to control which GPU is used by pybullet?

Training for RL with the GUI connection would require the option not to open the ExampleBrowser window. I expected COV_ENABLE_GUI to do this. Would this be sufficient to allow running in headless mode?

The following line the documentation is confusing for me:
"Note that DIRECT mode has no OpenGL, so it requires ER_TINY_RENDERER."
From this I understood that App_PhysicsServer_SharedMemory uses OpenGL, although its never explicitly stated. This does however not appear to be the cause as shown by fps performance and GPU memory consumption.

To get fast OpenGL based rendering another option would be to use the recently introduced dynamically loaded render plugins. One approach would be to adapt the roboschool rendering as a plugin. But before I attempt this I would like to ask if it is this a reasonable approach and if this code has already been written by somebody else.

Thanks for any help in advance.

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2018

Also:

  1. In the pybullet documentation google doc the setJointMotorControlArray input variable linkIndices should be jointIndices.
  2. Should a way to access btCollisionShape::setMargin from pybullet be added?
@erwincoumans

This comment has been minimized.

Copy link
Member

commented Feb 2, 2018

You cannot easily train in GUI or SHARED_MEMORY mode unless you only have 1 thread. You could just use DIRECT mode and TinyRenderer. Apparently, rendering the ground plane takes most time (try removing the loadURDF("plane.urdf") ). TinyRenderer hasn't been optimized.

https://www.youtube.com/watch?v=__ilQzkPDNI

Yes, this example runs over 1000FPS, but there is no 'getCameraImage' used: most of the locomotion RL tasks don't use vision/camera.

roboschool rendering

That won't help, Bullet's own OpenGL renderer is likely faster than roboschool's one. Time is currently going into other parts, see screenshot below.

When in GUI / hardware OpenGL mode, the pixels are copied from the main window. Resizing it to match the requested camera image dimensions helps. Also disable the GUI (= buttons etc). See attached updated version.
rendertest.zip

Multi-GPU setups are a hassle indeed, can you just take out the GPU?
See also #1481

setJointMotorControlArray input variable linkIndices should be jointIndices.

Thanks, fixed

btCollisionShape::setMargin

No, setMargin is not exposed, it is not consistently implemented for various shapes, and would cause confusion. Why would you need/want it?

Later this year we will use vision/synthetic camera and likely optimize the 'getCameraImage' implementation. Note that very little time is currently spend in rendering (less than 200 microseconds on my 1080), it is mainly copying data / waiting that makes is several milliseconds.

When you run my updated script, there is additional performance logging. The renderTimings.json can be opened in Google Chrome using about://tracing.

image

@erwincoumans

This comment has been minimized.

Copy link
Member

commented Feb 2, 2018

The preview windows in the BulletExampleBrowser (GUI, SHARED_MEMORY etc) takes some time too. You can make it faster (when using renderer=ER_BULLET_HARDWARE_OPENGL) by disabling them:

 optionstring='--width={} --height={}'.format(pixelWidth,pixelHeight)
        
        cid = pybullet.connect(self.connection_mode, options=optionstring,*self.argv)
        if cid < 0:
            raise ValueError
        print("connected")
        pybullet.configureDebugVisualizer(pybullet.COV_ENABLE_GUI,0)
        pybullet.configureDebugVisualizer(pybullet.COV_ENABLE_SEGMENTATION_MARK_PREVIEW,0)
        pybullet.configureDebugVisualizer(pybullet.COV_ENABLE_DEPTH_BUFFER_PREVIEW,0)
        pybullet.configureDebugVisualizer(pybullet.COV_ENABLE_RGB_BUFFER_PREVIEW,0)

With this, I get around 280FPS using GTX 1080. It could be further optimized from 4ms to 2ms with a lot of work.

image

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2018

@erwincoumans

This comment has been minimized.

Copy link
Member

commented Feb 2, 2018

  1. DIRECT connection mode does not use GPU rendering.

Indeed. DIRECT means each PyBullet call is directly process in the same thread, using purely CPU (=TinyRenderer at the moment).

  1. The one thread per bullet instance constraint for GUI / DIRECT is
    satisfied if I run bullet instances in separate processes using
    multiprocessing.Process as done by SubprocVecEnv.

Yes, if you use separate processes, you can use multiple examplebrowsers / GUI mode.

  1. What FPS are you getting for getting for BulletExampleBrowser w/
    and w/o preview windows. (edit: this was answered, thanks)

Around 280 FPS, which is pretty good.

  1. The fastest solution would probably be to write a TF GPU Op which uses
    cuda's OpenGL Interoperability API to do a OpenGL->TF copy. This is a bit
    advanced for me, but cursory googling suggest this would require sharing
    OpenGL contexts between processes via GLX_EXT_import_context.

Yes, this would require a lot of effort.

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2018

So in your last post on the image "Rendering" -> getCameraImage and "Render2" ->BulletExampleBrowser.

4ms? there's only like .4ms of real rendering in there. How do I parallelize rendering/physics? 🚀 😁

@erwincoumans

This comment has been minimized.

Copy link
Member

commented Feb 2, 2018

Things are already parallelized, physics runs in a separate thread from graphics.

Rendering + readback GPU is around 2.5 ms. I wouldn't waste time going from 4ms to 2.5ms, hence take 280FPS as your target. You cannot ignore the readback and copy of pixels.

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2018

Ok, I'll try to add a method for GUI not to open an examplebrowser window. This would allow training.

@erwincoumans

This comment has been minimized.

Copy link
Member

commented Feb 2, 2018

There is already some implementation using EGLOpenGLWindow. We used this internally in Google in the cloud. It will open a context using EGL (not X11) and the window is not visible. What is your target platform?

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2018

Ubuntu with Nvidia Titian X, I think EGLOpenGLWindow should work for me.

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2018

Is this implementation based on the GUI connection mode (examplebrowser without examplebrowser) or SHARED_MEMORY (App_PhysicsServer_SharedMemory).

If it would be possible to make this option available (on github) that would be very helpful for me and others too.

Edit: Otherwise I'll try to take a stab at this by reducing OpenGLExampleBrowser.cpp

@erwincoumans

This comment has been minimized.

Copy link
Member

commented Feb 2, 2018

EGLOpenGLWindow is based on GUI mode. No need to 'reduce' anything.

@erwincoumans

This comment has been minimized.

Copy link
Member

commented Feb 2, 2018

You need to define BT_USE_EGL in your build system, and make it all compile again. I don't have time to dig into this right now.

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2018

👍 thanks a lot for all of the help!

I’ll try this next week.

@BlGene

This comment has been minimized.

Copy link
Contributor Author

commented Feb 16, 2018

I have tried to create a PR to make this stuff a bit easier, please see: #1565.

@BlGene BlGene closed this Feb 28, 2018

@Rkartik27

This comment has been minimized.

Copy link

commented Jun 12, 2019

Hi Erwin,

I am using a Nvidia Quadro M5000M GPU (8GB) in Ubuntu 16.04 with python 3.5.2 version. I am having a problem in replicating such a high FPS (you said that you are getting a 280 FPS) with my computer. I have even tried building pybullet from the source code using "python3 setup.py install --user" from the latest source code as I heard that source built pybullet might have GPU support.

rendertest_fps

pybullet also seems to be using the GPU in my laptop for the code shared by you in rendertest.zip (link: https://github.com/bulletphysics/bullet3/files/1690473/rendertest.zip). The screenshot I shared above shows my nvidia-smi output in which python3 is the memory being used by the pybullet code (please ignore the other python 2 processes; other Tensorflow codes are using the GPU). However I am only getting an average FPS of around 20 as seen in the print output below:

pybullet build time: Jun 12 2019 17:39:14
connecting
argv[0]=--width=640
argv[1]=--height=480
startThreads creating 1 threads.
starting thread 0
started thread 0
argc=4
argv[0] = --unused
argv[1] = --width=640
argv[2] = --height=480
argv[3] = --start_demo_name=Physics Server
ExampleBrowserThreadFunc started
X11 functions dynamically loaded using dlopen/dlsym OK!
X11 functions dynamically loaded using dlopen/dlsym OK!
Creating context
Created GL 3.3 context
Direct GLX rendering context obtained
Making context current
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=Quadro M5000M/PCIe/SSE2
GL_VERSION=3.3.0 NVIDIA 396.54
GL_SHADING_LANGUAGE_VERSION=3.30 NVIDIA via Cg compiler
pthread_getconcurrency()=0
Version = 3.3.0 NVIDIA 396.54
Vendor = NVIDIA Corporation
Renderer = Quadro M5000M/PCIe/SSE2
b3Printf: Selected demo: Physics Server
startThreads creating 1 threads.
starting thread 0
started thread 0
MotionThreadFunc thread started
connected
ven = NVIDIA Corporation
**Testing SHARED
mean: 19.56697039069384 for 100 runs

Testing SHARED w/o shadow
mean: 18.793571272315447 for 100 runs**

Writing 17186 timings for thread 0
Writing 2202 timings for thread 1
Writing 47693 timings for thread 2
numActiveThreads = 0
stopping threads
Thread with taskId 0 exiting
destroy semaphore
semaphore destroyed
Thread TERMINATED
destroy main semaphore
main semaphore destroyed
finished
numActiveThreads = 0
btShutDownExampleBrowser stopping threads
Thread with taskId 0 exiting
Thread TERMINATED
destroy semaphore
semaphore destroyed
destroy main semaphore
main semaphore destroyed

Any help in my case would be highly appreciated, thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.