Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PCF Loop Bounds Constant #22

Closed
CaffeineViking opened this issue Dec 28, 2018 · 1 comment
Closed

Make PCF Loop Bounds Constant #22

CaffeineViking opened this issue Dec 28, 2018 · 1 comment
Assignees
Labels
Priority: Low Project: Renderer Issues relating to the core hair renderer itself. Type: Optimize Suggestion to optimize an specific algorithm. Type: Refactor Make this thing a bit less shitty, pretty please!

Comments

@CaffeineViking
Copy link
Owner

CaffeineViking commented Dec 28, 2018

I've found that we can save 3-4ms on the strand self-shadows of the ADSM approach if the PCF kernel radius is known at shader compilation time, it looks like glslc is unrolling the loop and making more assumptions. I know this for sure makes a difference on my laptop, I still need to profile the benefits on the AMD-machine (I've heard that performance can sometimes even decrease if the compiler is too aggressive in doing loop unrolling on GCN, so I'll profile to make sure this is worth it). For reference, on my laptop, in the worst case (i.e. hair occupies the entire screen) I get ~9.8ms for rendering the ponytail, without any shadows, I get 5.8ms, and without any shading at all, 4.8ms (an unavoidable cost of rasterizing 1.8M lines). So a performance breakdown gives us:

  • Rasterization: 4.8ms
  • Kajiya-Kay: 1ms
  • 3x3 PCF ADSM: 4ms
  • Total cost: 9.8ms

By having constant loop bounds the entire pass instead takes ~6.8ms, and therefore we'll get these results:

  • Rasterization: 4.8ms
  • Kajiya-Kay: 1ms
  • 3x3 PCF ADSM: 1ms
  • Total cost: 6.8ms

For further reference, on the beefy AMD-machine, the unoptimized version (that took 9.8ms on my laptop), takes around 1.5ms for the entire pass (i.e. total cost is 1.5ms). I haven't tested the optimized version yet, but if we assume the speedup translates to AMD-hardware as well, we'll go from 1.5ms to ~1ms. This will leave us more room to do other cool things (and we still need to think about leaving time for the OIT), so I think this optimization is worth trying out.

I'll see if I can make the offending calculation go away, or just assume that we use 3x3 kernels all the time. A way would be to use specialization constants, and re-compile the shaders when the PCF radius is modified.

@CaffeineViking CaffeineViking added Type: Refactor Make this thing a bit less shitty, pretty please! Project: Renderer Issues relating to the core hair renderer itself. Priority: Medium Type: Optimize Suggestion to optimize an specific algorithm. labels Dec 28, 2018
@CaffeineViking CaffeineViking self-assigned this Dec 28, 2018
@CaffeineViking
Copy link
Owner Author

I tested this before, and the results were not as extreme on AMD hardware. Let's forget about this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low Project: Renderer Issues relating to the core hair renderer itself. Type: Optimize Suggestion to optimize an specific algorithm. Type: Refactor Make this thing a bit less shitty, pretty please!
Projects
None yet
Development

No branches or pull requests

1 participant