Skip to content

Early Fragment Tests, Hi Z, Depth, Stencil and other benchmarks

Mateusz Kielan edited this page Jun 20, 2018 · 19 revisions

Modifying gl_FragDepth

Here are example benchmarks on Nvidia 1060 with a texture sampler heavy shader (1000 samples from a 2D texture per-pixel):

  1. 17FPS No tests, no depth attachment
  2. 60FPS Depth test enabled
  3. 12FPS Depth test enabled and writing gl_FragDepth = gl_FragCoord.z; in the shader
  4. 60FPS gl_FragDepth = gl_FragCoord.z; with #extension GL_ARB_conservative_depth: enable and layout (depth_unchanged) out float gl_FragDepth;
  5. 60FPS gl_FragDepth = gl_FragCoord.z; with layout(early_fragment_tests) in;
  6. 14FPS gl_FragDepth = gl_FragCoord.z; with #extension GL_ARB_conservative_depth: enable and layout (depth_less) out float gl_FragDepth;
  7. 14FPS gl_FragDepth = gl_FragCoord.z-0.00001; with #extension GL_ARB_conservative_depth: enable and layout (depth_less) out float gl_FragDepth;
  8. 60FPS gl_FragDepth = gl_FragCoord.z-0.00001; with #extension GL_ARB_conservative_depth: enable and layout (depth_unchanged) out float gl_FragDepth;

It's surprising that 6,7 do not work, as they are consistent with the direction the the depth buffer (its a reverse depth buffer, so depth func is GEQUAL, even tester with just GREATER and got same results). Obivously (8) produces incorrect results visually.

Later tests (masking depth vs. stencil) revealed that sometimes NVidia violates the ARB_conservative_depth spec:

The layout qualifier for gl_FragDepth specifies constraints on the final value of gl_FragDepth written by any shader invocation. GL implementations may perform optimizations assuming that the depth test fails (or passes) for a given fragment if all values of gl_FragDepth consistent with the layout qualifier would fail (or pass). If the final value of gl_FragDepth is inconsistent with its layout qualifier, the result of the depth test for the corresponding fragment is undefined. However, no error will be generated in this case. When the depth test passes and depth writes are enabled, the value written to the depth buffer is always the value of gl_FragDepth, whether or not it is consistent with the layout qualifier.

When the layout qualifier is <depth_unchanged>, the shader compiler will honor any modification to gl_FragDepth, but the rest of the GL assume that gl_FragDepth is not assigned a new value.

NVIDIA, I expected better of you...

But Nvidia fucked up implementing ARB_conservative_depth fully in both OpenGL and DirectX! Here's a guy that had the same issue with reverse depth buffer under DX Which is really funny because its basically a static shader analysis CPU-side check, which way does the shader modify the depth and what depth func has been set. I guess NVidia optimizes when depth func is GL_LEQUAL or GL_LESS and layout is depth_greater, if at all.

Conclusion

If you want to do something like a displacement mapping shader which moves pixels deeper into the framebuffer, then lie in your layout qualifier and say that the depth is unchanged.

early_fragment_tests are not an option as the depth gets written to the Z-Buffer before the fragment shader executes (basically as soon as the test is done).

Using 'discard'

Test 1 - set up is the same as previous:

  1. 60 FPS base speed
  2. 24 FPS for using discard on half of the screen
  3. 13 FPS for discarding one pixel
  4. 85 FPS for early_fragment_tests (the problem is that the colour gets discarded, but depth gets written)

Test 2 - clear depth buffer to 0.001 (10 units away from the camera):

  1. 210 FPS base speed
  2. 245 FPS for using discard on half of the screen
  3. 210 FPS to discard one pixel
  4. 210 FPS for early_fragment_tests
  5. 210 FPS for conservative depth

Conclusion

If you don't need to write to depth, then use early_fragment_tests.

Good news is, that 'discard' seems to not disable the early-Z test, i.e. the group of objects being drawn with the shader is getting early fragment tests against some previous version of the Z-Buffer which is not being updated by the group.

Educated Guess It seems that the Hi-Z & Early-Z is constructed as some sort of a block-cache (quadtree or something) which stores MaxZ or MinZ depending on comparison function so the cache block can only be overwritten confidently by an entire triangle covering the block (as the min/max depth range from a single barycentrically interpolated triangle is a fast op) which has already passed the depth test.

From AMD documentation Since it only stores one of the values changing the comparison direction in the middle of a frame makes the existing values in the buffer unusable. Thus HiZ will have to be disabled until the buffer is either cleared or the application changes the comparison direction back to the original. Typically the driver will configure the buffer depending on the first depth test function used after the depth buffer is cleared

So basically the assumption is that you start off with a HiZ buffer, which will 'fill up' leaving you less and less space in the Z-direction for unoccluded pixels hence some updates can be skipped or omitted as its a conservative estimate.

So maybe the reason why it still kinda works with 'discard' shaders is that the HiZ is hardwired into the rasterizer, and the more fine grained single-pixel-resolution early-Z is not. Especially given that:

The Radeon HD 2000 series chips can use HiZ for both depth and stencil. However, for older hardware, HiZ only operates on depth values. For hardware prior to the Radeon HD 2000 series, all fragment rejection based on stencil values is on a per pixel granularity, and is performed in the Early Z stage which is later in the pipe.

Later in the same document it states:

but there are cases where the stencil operation may interfere with HiZ causing it to be disabled. If either of z-fail or stencil-fail operations is not KEEP, HiZ gets disabled. This is because it can’t reject tiles if the stencil is to be updated for any pixel in the rejected tiles. This is the case with z-fail, and it could be the case for stencil-fail (depending on the outcome of the stencil test). The stencil-pass operation doesn’t interfere with HiZ though since stencil only passes if the depth test also passes, meaning that if a tile is rejected based on the depth test we’re guaranteed that the stencil-pass operation won’t get executed for any fragment in the tile.

Which supports my theory that if the HiZ stores the max depth for a block (or min depth in our reverse-Z case) then it would be very hard to compute a new conservative max or min-depth for the new triangle with holes made by 'discard' without actually doing a reduction over the newly produced values (expensive). Hence it would make no sense to update the HiZ buffer and the best we can do is disable updating it, yet still test against it.

And in conclusion reverting the depth comparison function is performance suicide.

Modifying gl_FragDepth Round 2

Tried doing the test 2 setup from the 'discard' test, some of the cases change for the better:

  1. Same 😄
  2. 210FPS Depth test enabled
  3. 210 FPS [BETTER - Changed!] (for some reason the shader compiler knows you're not changing the depth and is doing an early test but decides not to update the cache)
  4. [SAME]
  5. [SAME]
  6. [WORSE]
  7. [WORSE]
  8. [SAME]

Masking out stuff at a certain depth (like particles)

According to this guy 'discard' will also mess with the early stencil test 😢

To make the test equal I disabled ZWrite on the meshes being drawn, so they dont cull themselves giving unfair advantage to the ZBuffer method

  1. 57FPS Base perf without any masking
  2. 90FPS Drawing a screenquad at depth 1.0 over half of the screen and using the depth test
  3. 89FPS Drawing a screenquad over half of the screen and writing out specified depth to gl_FragDepth (1.0) and using depth test
  4. 89FPS Drawing a screenquad over half the screen and set stencil to 1 without depth or stencil test, then use stencil test only
  5. 89FPS Drawing a screenquad over all of the screen and using 'discard' to not overwrite the other half, setting the stencil to 1, then use stencil test

For curiosity I enabled ARB_conservative_depth and set the layout qualifier to depth_unchanged in test (2) and wrote a depth of 0.0 (which in reverse depth equals to the far plane), and the value being written to gl_FragDepth was not honoured.

Conclusion Seems that Early-Z & Hi-Z disables only for the current shader, and can be restored without rebinding the FBO or clearing the ZBuffer, which is nice, considered that when I started programming GPUs (Nvidia 9600) a single discard or depth write would disable Early-Z for the remainder of the frame.

Other Notes

Always clear Depth and Stencil Buffer together

All ATI hardware stores depth and stencil information in the same buffer so for best performance it is essential that both depth and stencil be cleared together. Otherwise a slower read/modify/write operation will occur, like for instance only clearing the depth part of the buffer and leaving the stencil buffer untouched.

Early-Z could still be enabled with 'discard' as long as depth-write mask is disabled

According to the AMD/ATI document I was quoting from throughout this wiki page, that will definitely happen on Radeon hardware. As for Nvidia, I have not tested.

'discard' also fucks with Early-Stencil tests

But the question is if it also fucks with them if stencil mask is disabled? (analogous case to the depth situation)

You can still use a single quad for Stencil-tested lights in a vanilla deferred renderer

Basically you can still write to 'gl_FragDepth' if depth write is disabled with conservative_depth.

BIG QUESTION: Nvidia, AMD... Why can't I have Early-Z, discard and depth-write?

It seems that I can test and output a different value using ARB_conservative_depth, so the logical capability is definitely there!

The only explanation is they have a hardware atomic combined compareXStoreY(comparedepth,storedepth)-like function to update depth and they cannot test comparedepth separately from writing storedepth.

Clone this wiki locally