Reuse GPU-created depth buffer and unproject it during a read #150

mosra · 2019-08-13T21:27:11Z

Motivation and Context

This is a first step towards the goal of integrating Magnum builtin shaders -- they don't have any such feature and thus it was a blocker that had to be removed first.

Apart from that this gives a nice little speedup in color-only rendering (0.2 ms) thanks to one clear less and one less render target being written to; while the additional operations for depth rendering aren't really slower in total.

How Has This Been Tested

Pytest passes (it was a pain but yes it does!), CI passed the 1400 FPS milestone 🎉

msavva

Thanks @mosra
LGTM 👍The detailed docs for the unprojection logic are much appreciated.
Given that tests are also passing, feel free to land.

msbaines · 2019-08-14T17:51:08Z

With this patch, the reverse transformation of depth values is done on the CPU instead of the GPU. Pushing pixel processing to the CPU will have a noticeable affect on CPU utilization. I suspect CPU utilization is much higher than before. This also precludes the GPU-GPU transfer of depth data to CUDA and forces a GPU-CPU and a CPU-GPU copy compared to the ideal case of no copies.

bigbike · 2019-08-14T18:09:28Z

src/esp/gfx/GenericShader.h

@@ -54,10 +54,8 @@ class GenericShader : public Magnum::GL::AbstractShaderProgram {
  enum : uint8_t {


Surprised. I thought such enum definition would require C++17, while we were using C++14 in this project.

@mosra:
Are these outputs allso required to be "mutually exclusive" for each shader?

For textures, I introduced a new header file called TextureBindingPointIndex.h in #132 to define enumerators (binding points) so that texture units would not conflict. I would like to move the "textureLayer" defined in GenericShader.cpp to this file. Any comments?

I think this enum syntax was available in c++11:

https://en.cppreference.com/w/cpp/language/enum

Yes, C++11 added two new things to enums -- strong enums (enum class) and typed enums (enum: type). Those two are completely orthogonal features, one can be used without the other.

Texture binding points -- see my other comment at #151 (comment)

erikwijmans · 2019-08-14T18:14:08Z

This also precludes the GPU-GPU transfer of depth data to CUDA and forces a GPU-CPU and a CPU-GPU copy compared to the ideal case of no copies.

For GPU-GPU, the unprojection can be done on the GPU via CUDA quite easily.

msbaines

LGTM. As Erik commented, my initial concerns about GPU-GPU will be addressed by doing the depth unprojection in CUDA.

src/esp/gfx/Renderer.cpp

msbaines · 2019-08-15T06:37:03Z

src/esp/gfx/Renderer.cpp

+    */
+    const Matrix4 projection = camera.getMagnumCamera().projectionMatrix();
+    const Matrix4 invertedProjection =
+        Matrix4::translation(Vector3::zAxis(-projection.translation().z())) *


I think this is where you are chopping off the near-plane by moving the near-plane to z=0. How does it work? Why would projection.translation().z() give you the correct z-translation?

Hmm. 🤔 Good point, actually 👍

I won't lie -- without this, the output was different from the previous depth value written explicitly by the shader by about 0.02. And the projection translation Z is also about 0.02, and this operation seemed to make sense at the time -- that was even before the tests were passing, as I hadn't yet discovered the discrepancy between mesh holes being represented as 0 in the previous case and 1 by the GPU depth buffer, and thus wasn't aware that this shift doesn't have much effect on the test passing. It passes both with and without, the threshold is large enough to allow that.

Now, when going through the actual matrix math, it seems totally arbitrary and just a lucky coincidence 😅 Furthermore, when removing this arbitrary shift, the resulting depth frame is much closer to the actual ground truth (could reduce the threshold to 0.3e-2 from 1.5e-2 and it still passes). Removing this, then, thanks a lot for the heads-up.

A remaining mystery is the 0.02 difference between values produced by the GPU depth buffer and by the explicit shader calculation -- but since I'm closer to the ground truth with the GPU depth buffer, I would assume it's due to some rounding errors happening at various places (and also the switch from 24-bit depth to 32F depth could have an effect on that).

Instead of creating our own additional one. Main goal is to allow easier integration of Magnum builtin shaders as they don't have any such feature. Apart from that this gives a nice little speedup in color-only rendering (0.2 ms) thanks to one clear less and one less render target being written to; while the additional operations for depth rendering aren't really slower in total.

mosra · 2019-08-15T12:58:56Z

@erikwijmans @msbaines updated to address your comments about the 2x2 and suspicious near plane shift, thanks a lot for calling me out on my rusty math :)

@msavva for efficient GPU/GPU depth transfer, I think it's better to have the depth unprojection shader on the GL side (and not on CUDA) as it can share code with the depth-only shader. Do you think it would make a sense to sketch that out as part of this PR? Or better separately? The #114 depends on that, so it needs to be done one way or the other :)

msavva · 2019-08-15T15:48:27Z

@msavva for efficient GPU/GPU depth transfer, I think it's better to have the depth unprojection shader on the GL side (and not on CUDA) as it can share code with the depth-only shader. Do you think it would make a sense to sketch that out as part of this PR? Or better separately? The #114 depends on that, so it needs to be done one way or the other :)

Agreed that it makes more sense to do this on the GL side. I'm strongly in favor of smaller PRs 🙂 -- so I'd say that it's better to land this PR and create a separate one for the depth-only shader.

mosra · 2019-08-15T21:43:16Z

Great, so unless somebody is against, I'll merge this and #151 tomorrow morning (your today late evening). Then opening further PRs for the aforementioned shader cleanup and depth-only rendering / depth unprojection.

msbaines · 2019-08-16T04:22:39Z

src/esp/gfx/Renderer.cpp

+
+        | a b |   | z |   | az + b |
+        | c d | * | 1 | = | cz + d |
+


I think it may be possible to solve the perspective projection for z.

Given: projectionMatrix[2].z = A projectionMatrix[3].z = B projectionMatrix[2].w = -1 projectionMatrix[3].w = 0 zperspective = | A B | | z | | Az + B | = (Az + B)/-z | -1 0 | * | 1 | = | -z | zperspective = 2.0 * depth - 1.0 Plugging into above: 2.0 * depth - 1.0 = (Az + B)/-z Solving for depth: z = -B/(2.0 * depth + A - 1.0) Inverting (to get positive values): z = B/(2.0 * depth + A - 1.0) If addition is used for 2.0 * depth, only 2 additions and 1 divide are required.

Thank you!

Will apply this in a separate PR together with setting up C++-side testing for this -- wanted to measure the perf difference for this :)

Update after benchmarking -- this makes the calculation slightly faster (thanks again), however the most time (>75%) is consumed by the if() that flips values at the far plane to 0.0f :/ In fact, even doing the dumbest full 4x4 multiplication and extracting just Z afterwards is a minor perf penalty compared to costs of the if.

It gets a bit faster when done in a separate loop, however that makes it ~~no longer possible~~ much harder to do the unprojection in-place.

Cc: @erikwijmans

Can you try a ternary instead of an if, i.e. ptr[i] = z == 1.0f ? 0.0f : <unprojected>? Jumps are muuuccch slower than conditional moves and the compute cost of getting the unprojected value is so trivial that computing it more often than needed shouldn't matter :)

No change, as expected -- I assume the optimizer would try that as a first thing.

Problem is that the ifs prevent code from being batched & vectorized. If I move the condition to a separate loop it gets three times faster, but still about 2x slower than a dumb 4x4 multiplication without the branch would be.

Anyway, will submit this together with the GPU version tomorrow. Maybe you figure out an improvement again :)

Ah, yeah. The ternary won't help with the CPU version, but I believe it'll help with the OpenGL version (it 100% would with CUDA version, if's are terrible on CUDA).

* episode stats changed to dict of dicts * handle non-unique episode_id * use tuple as key * update while loop checking Co-Authored-By: Erik Wijmans <ewijmans2@gmail.com>

…okresearch#150) * Fix build without deprecated Magnum features. * Reuse GPU-made depth buffer and unproject it during a read. Instead of creating our own additional one. Main goal is to allow easier integration of Magnum builtin shaders as they don't have any such feature. Apart from that this gives a nice little speedup in color-only rendering (0.2 ms) thanks to one clear less and one less render target being written to; while the additional operations for depth rendering aren't really slower in total.

Fix build without deprecated Magnum features.

ccfb4ba

mosra requested review from msavva and bigbike August 13, 2019 21:27

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Aug 13, 2019

msavva approved these changes Aug 13, 2019

View reviewed changes

This was referenced Aug 13, 2019

Replacing GenericShader with Magnum's Shaders::Flat #151

Merged

Use Phong shader for rendering scenes #152

Closed

msavva requested a review from msbaines August 14, 2019 01:55

bigbike reviewed Aug 14, 2019

View reviewed changes

msbaines approved these changes Aug 14, 2019

View reviewed changes

msbaines reviewed Aug 14, 2019

View reviewed changes

src/esp/gfx/Renderer.cpp Show resolved Hide resolved

msbaines reviewed Aug 15, 2019

View reviewed changes

mosra force-pushed the reusing-depth-buffer branch from d8688e5 to 25acbc8 Compare August 15, 2019 12:51

msbaines reviewed Aug 16, 2019

View reviewed changes

mosra merged commit 323bd6b into master Aug 16, 2019

mosra deleted the reusing-depth-buffer branch August 19, 2019 13:28

mosra mentioned this pull request Aug 20, 2019

GPU-based depth unprojection #164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse GPU-created depth buffer and unproject it during a read #150

Reuse GPU-created depth buffer and unproject it during a read #150

mosra commented Aug 13, 2019 •

edited

Loading

msavva left a comment

msbaines commented Aug 14, 2019

bigbike Aug 14, 2019

msbaines Aug 14, 2019

mosra Aug 15, 2019

erikwijmans commented Aug 14, 2019

msbaines left a comment

msbaines Aug 15, 2019

mosra Aug 15, 2019

mosra commented Aug 15, 2019

msavva commented Aug 15, 2019

mosra commented Aug 15, 2019

msbaines Aug 16, 2019

mosra Aug 16, 2019

mosra Aug 19, 2019

erikwijmans Aug 19, 2019 •

edited

Loading

mosra Aug 19, 2019

erikwijmans Aug 19, 2019 •

edited

Loading

		@@ -54,10 +54,8 @@ class GenericShader : public Magnum::GL::AbstractShaderProgram {
		enum : uint8_t {

Reuse GPU-created depth buffer and unproject it during a read #150

Reuse GPU-created depth buffer and unproject it during a read #150

Conversation

mosra commented Aug 13, 2019 • edited Loading

Motivation and Context

How Has This Been Tested

msavva left a comment

Choose a reason for hiding this comment

msbaines commented Aug 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikwijmans commented Aug 14, 2019

msbaines left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosra commented Aug 15, 2019

msavva commented Aug 15, 2019

mosra commented Aug 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikwijmans Aug 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikwijmans Aug 19, 2019 • edited Loading

Choose a reason for hiding this comment

mosra commented Aug 13, 2019 •

edited

Loading

erikwijmans Aug 19, 2019 •

edited

Loading

erikwijmans Aug 19, 2019 •

edited

Loading