Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse GPU-created depth buffer and unproject it during a read #150

Merged
merged 2 commits into from
Aug 16, 2019

Conversation

mosra
Copy link
Collaborator

@mosra mosra commented Aug 13, 2019

Motivation and Context

This is a first step towards the goal of integrating Magnum builtin shaders -- they don't have any such feature and thus it was a blocker that had to be removed first.

Apart from that this gives a nice little speedup in color-only rendering (0.2 ms) thanks to one clear less and one less render target being written to; while the additional operations for depth rendering aren't really slower in total.

image

How Has This Been Tested

Pytest passes (it was a pain but yes it does!), CI passed the 1400 FPS milestone 🎉

@mosra mosra requested review from msavva and bigbike August 13, 2019 21:27
@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Aug 13, 2019
Copy link
Collaborator

@msavva msavva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mosra
LGTM 👍The detailed docs for the unprojection logic are much appreciated.
Given that tests are also passing, feel free to land.

@msbaines
Copy link
Contributor

With this patch, the reverse transformation of depth values is done on the CPU instead of the GPU. Pushing pixel processing to the CPU will have a noticeable affect on CPU utilization. I suspect CPU utilization is much higher than before. This also precludes the GPU-GPU transfer of depth data to CUDA and forces a GPU-CPU and a CPU-GPU copy compared to the ideal case of no copies.

@@ -54,10 +54,8 @@ class GenericShader : public Magnum::GL::AbstractShaderProgram {
enum : uint8_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised. I thought such enum definition would require C++17, while we were using C++14 in this project.

@mosra:
Are these outputs allso required to be "mutually exclusive" for each shader?

For textures, I introduced a new header file called TextureBindingPointIndex.h in #132 to define enumerators (binding points) so that texture units would not conflict. I would like to move the "textureLayer" defined in GenericShader.cpp to this file. Any comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this enum syntax was available in c++11:

https://en.cppreference.com/w/cpp/language/enum

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, C++11 added two new things to enums -- strong enums (enum class) and typed enums (enum: type). Those two are completely orthogonal features, one can be used without the other.

Texture binding points -- see my other comment at #151 (comment)

@erikwijmans
Copy link
Contributor

This also precludes the GPU-GPU transfer of depth data to CUDA and forces a GPU-CPU and a CPU-GPU copy compared to the ideal case of no copies.

For GPU-GPU, the unprojection can be done on the GPU via CUDA quite easily.

Copy link
Contributor

@msbaines msbaines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As Erik commented, my initial concerns about GPU-GPU will be addressed by doing the depth unprojection in CUDA.

*/
const Matrix4 projection = camera.getMagnumCamera().projectionMatrix();
const Matrix4 invertedProjection =
Matrix4::translation(Vector3::zAxis(-projection.translation().z())) *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where you are chopping off the near-plane by moving the near-plane to z=0. How does it work? Why would projection.translation().z() give you the correct z-translation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. 🤔 Good point, actually 👍

I won't lie -- without this, the output was different from the previous depth value written explicitly by the shader by about 0.02. And the projection translation Z is also about 0.02, and this operation seemed to make sense at the time -- that was even before the tests were passing, as I hadn't yet discovered the discrepancy between mesh holes being represented as 0 in the previous case and 1 by the GPU depth buffer, and thus wasn't aware that this shift doesn't have much effect on the test passing. It passes both with and without, the threshold is large enough to allow that.

Now, when going through the actual matrix math, it seems totally arbitrary and just a lucky coincidence 😅 Furthermore, when removing this arbitrary shift, the resulting depth frame is much closer to the actual ground truth (could reduce the threshold to 0.3e-2 from 1.5e-2 and it still passes). Removing this, then, thanks a lot for the heads-up.

A remaining mystery is the 0.02 difference between values produced by the GPU depth buffer and by the explicit shader calculation -- but since I'm closer to the ground truth with the GPU depth buffer, I would assume it's due to some rounding errors happening at various places (and also the switch from 24-bit depth to 32F depth could have an effect on that).

Instead of creating our own additional one. Main goal is to allow easier
integration of Magnum builtin shaders as they don't have any such
feature. Apart from that this gives a nice little speedup in color-only
rendering (0.2 ms) thanks to one clear less and one less render target
being written to; while the additional operations for depth rendering
aren't really slower in total.
@mosra
Copy link
Collaborator Author

mosra commented Aug 15, 2019

@erikwijmans @msbaines updated to address your comments about the 2x2 and suspicious near plane shift, thanks a lot for calling me out on my rusty math :)

@msavva for efficient GPU/GPU depth transfer, I think it's better to have the depth unprojection shader on the GL side (and not on CUDA) as it can share code with the depth-only shader. Do you think it would make a sense to sketch that out as part of this PR? Or better separately? The #114 depends on that, so it needs to be done one way or the other :)

@msavva
Copy link
Collaborator

msavva commented Aug 15, 2019

@msavva for efficient GPU/GPU depth transfer, I think it's better to have the depth unprojection shader on the GL side (and not on CUDA) as it can share code with the depth-only shader. Do you think it would make a sense to sketch that out as part of this PR? Or better separately? The #114 depends on that, so it needs to be done one way or the other :)

Agreed that it makes more sense to do this on the GL side. I'm strongly in favor of smaller PRs 🙂 -- so I'd say that it's better to land this PR and create a separate one for the depth-only shader.

@mosra
Copy link
Collaborator Author

mosra commented Aug 15, 2019

Great, so unless somebody is against, I'll merge this and #151 tomorrow morning (your today late evening). Then opening further PRs for the aforementioned shader cleanup and depth-only rendering / depth unprojection.


| a b | | z | | az + b |
| c d | * | 1 | = | cz + d |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be possible to solve the perspective projection for z.

Given:

projectionMatrix[2].z = A
projectionMatrix[3].z = B
projectionMatrix[2].w = -1
projectionMatrix[3].w = 0

zperspective = | A  B |   | z |   | Az + B | = (Az + B)/-z
               | -1 0 | * | 1 | = |  -z    |

zperspective = 2.0 * depth - 1.0

Plugging into above:

2.0 * depth - 1.0 = (Az + B)/-z

Solving for depth:

z = -B/(2.0 * depth + A - 1.0)

Inverting (to get positive values):

z = B/(2.0 * depth + A - 1.0)

If addition is used for 2.0 * depth, only 2 additions and 1 divide are required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Will apply this in a separate PR together with setting up C++-side testing for this -- wanted to measure the perf difference for this :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update after benchmarking -- this makes the calculation slightly faster (thanks again), however the most time (>75%) is consumed by the if() that flips values at the far plane to 0.0f :/ In fact, even doing the dumbest full 4x4 multiplication and extracting just Z afterwards is a minor perf penalty compared to costs of the if.

It gets a bit faster when done in a separate loop, however that makes it no longer possible much harder to do the unprojection in-place.

Cc: @erikwijmans

Copy link
Contributor

@erikwijmans erikwijmans Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try a ternary instead of an if, i.e. ptr[i] = z == 1.0f ? 0.0f : <unprojected>? Jumps are muuuccch slower than conditional moves and the compute cost of getting the unprojected value is so trivial that computing it more often than needed shouldn't matter :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change, as expected -- I assume the optimizer would try that as a first thing.

Problem is that the ifs prevent code from being batched & vectorized. If I move the condition to a separate loop it gets three times faster, but still about 2x slower than a dumb 4x4 multiplication without the branch would be.

Anyway, will submit this together with the GPU version tomorrow. Maybe you figure out an improvement again :)

Copy link
Contributor

@erikwijmans erikwijmans Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah. The ternary won't help with the CPU version, but I believe it'll help with the OpenGL version (it 100% would with CUDA version, if's are terrible on CUDA).

@mosra mosra merged commit 323bd6b into master Aug 16, 2019
@mosra mosra deleted the reusing-depth-buffer branch August 19, 2019 13:28
eundersander pushed a commit to eundersander/habitat-sim that referenced this pull request Aug 6, 2020
* episode stats changed to dict of dicts

* handle non-unique episode_id

* use tuple as key

* update while loop checking

Co-Authored-By: Erik Wijmans <ewijmans2@gmail.com>
Ram81 pushed a commit to Ram81/habitat-web-sim that referenced this pull request Dec 10, 2020
…okresearch#150)

* Fix build without deprecated Magnum features.

* Reuse GPU-made depth buffer and unproject it during a read.

Instead of creating our own additional one. Main goal is to allow easier
integration of Magnum builtin shaders as they don't have any such
feature. Apart from that this gives a nice little speedup in color-only
rendering (0.2 ms) thanks to one clear less and one less render target
being written to; while the additional operations for depth rendering
aren't really slower in total.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants