New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusteredDeferred Example and Clustering Extension #216
Comments
Further ideas for cluster light histogram generation, via raster pipeline abuse:
On mobile (or when Tessellation+Geometry not available) compute shader can be used to emulate VS+TS+GS at a minor cost. When stencil limit is too little (non hierarchical approaches, or because the raster based prefixing is overly conservative), single attachment no depth R16UNORM FBO could be used with blending. |
Even more ridiculous idea, use stencil K-routed indep transparency to keep a list of N-apparently-strongest lights per cluster. |
Idea for culling non-contributing lights:
Notes: because of the cluster mask being used as a storage image, hi-z with depth format will be impossible to leverage. |
Need conservative voxelization for tagging clusters Transparents (and would be nice to test against Hi-Z and Cluster mask for them as well). |
Summing up:
|
Knowing the state-of-the art
Check ddx ddy in shader and atomically tag a range of clusters along Z (and loose the ability to have early cluster tests) OR don't Z-test and just reproject with geometry shader (CS for mobile).
MSAA approximation to conservative raster or CS shader is the only alternative.
Don't reproject triangles, or else lose the Z-Test. Even if we could launch 1 fragment per cluster intersected by a triangle, we couldn't leverage early-Z or early-stencil for culling against ACTIVE clusters without broadcasting triangles to different FBO layers (expensive). So there's no point in forcing the Stencil Mask to be the depth or stencil component. Semi Conclusion Alternative |
Nvidia hardware cannot do ARB_shader_stencil_export, which implies that the Cluster Mask bitfield cannot be in the stencil component (because different triangle pixels would need different stencil values depending on their z-value). However Cluster Mask in depth component would mean no z-tests at all because the pixel z-value is actually a Z-axis bin-mask now and there's no hardware z-comparison function that would actually work with that. Plus also we need z-test for Hi-Z cull against framebuffer. Conclusion Either broadcast your triangles to layers and pay at least 8x more memory for hardware early per-pixel tests as well as the perf decrease from chopping up your triangles, or give up on the hope of transparents culling themselves out, or use atomics on storage images.
But not all is lost, since now we have a free stencil component that we could use for this. |
Addtional Note If we wanted to leverage early-Z cull on voxelization, the Z buffer would have to be copied multiple times into the layers, and that would be a fail. |
There are only 2 approaches to culling the transparent triangle voxelization Only cull against scene Hi-Z Associate color attachment bits to clusters along the Z-axis, obvs storage required per Z-line gets rounded up to the nearest 4 bytes. Cull against Hi-Z and self Associate color attachment bits to clusters along the Z-axis, obvs storage required per Z-line gets rounded up to the nearest 4 bytes.
Conclusion The second approach's culling is redundant (you have to do the calc anyway to figure out the coverage) Use color attachment bits for cluster active masks (max 128 Z-partitions), don't choose dominant axis (no need with bitfield voxels), do conservative raster triangle expansion (with culling+filtering first) via CS or Geom Shader instead of the MSAA approach. |
Result the Light Clustering Algorithm Tries To Achieve A per-cluster count of probably visible light sources. For this reason, if using rasterization for voxelization, the light volumes need to be created via layered rendering and triangle broadcast (duplication) by CS or TS. In-Cluster Culling Live/Dead Stencil tagging cannot use 2 values to specify a "live" range (impractical for transparent voxelization), so keep the HiZ buffer around for both transparent tagging and light voxelization. Light Pre-Culling Before a light is voxelized into the cluster grid, the light list can be frustum culled, HiZ culled both to discard a light entirely, but also to trim its Z-range (useful in lowering broadcast ratio), all in the Compute or Tessellation Control shader. Light Voxelization It is desirable to spawn 1 thread per light per cluster for non-divergence and ideal parallelism. This is why the TC or CS light preprocessing stage should not attempt to construct triangle approximations for the light volume cross-section (1 thread per-triangle vertex optimum in either evaluation or vertex shader). Also there should be no Geometry-Shader hi-Z per-triangle culling (not enough clusters on average triangle to justify), unless it somehow happens in the CS/TC stage. For seamless/gapless rendering reasons indexed meshes shall be used to draw the cross-sections, with a square or a pentagon used to approximate the conical section. Although given the highly symmetric ellipse conical section, quadrilateral approximation is preferred over a pentagon or even a heaxgon. |
Light Clustering Schemes Regular Grid Results in heavy memory usage with lights that cover multiple clusters (as each cluster needs a copy of the light index). MipMap Pyramid In Frustum Based on the assumption that light-volumes will have roughly similar shapes, so the same lights near the camera will occupy many clusters (almost all) in the first few Z- partitions. This method does not introduce any more divergence in the shading (only divergent part is the mip-map coord to sample from). Octree This builds on the regular grid approach, storing a full octree in the mip-maps of the grid texture. This approach would likely introduce some divergence but only on the light fetching, not the lighting calculations themselves. However the real challenge is how to efficiently construct the said octree, because:
Is it possible to build an octree efficiently using less memory bandwidth than for the above? |
Pre-requisites are done |
no need anymore. |
Naturally in
irr::ext::ClusteredLighting
, the extension will not actually provide the deferred shading (so it can be modularily used for Forward+ as well).Base off:
http://www.humus.name/Articles/PracticalClusteredShading.pdf
Pre-requisites:
Few changes to Avalanche method:
Pseudocode for CPU light clustering:
Algorithm outline for GPU:
NOTE: Could abuse the vertex/pixel pipeline with an empty framebuffer here, where the first-pass would be a vertex shader producing a quad covering
threadsToSpawnForAdd
pixels, and the second-pass compute shader would be a pixel shader. (this is probably the better solution)NOTE2: Its possible to separate spotlights from point lights quite easily, but that would require virtually splitting each cluster into 2.
There is one more possible investigation for the future with the use of warp-intrinsics/shared memory, the N-ary search using all warps together... maybe a 4D skip-list is not necessary.
The text was updated successfully, but these errors were encountered: