New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gl_rasterizer: implement shadow map 2D/Cube - Image load/store version #3778

Merged
merged 1 commit into from Jun 22, 2018

Conversation

Projects
None yet
5 participants
@wwylele
Member

wwylele commented May 25, 2018

This is the other alternative to implement shadow map, mentioned in #3729.

The downside of this method so-far:

  • requires extension ARB_shader_image_load_store (and some other trivial extension). Many modern GPUs have this implemented tho. The overall support rate is 70% reported by @jroweboy
  • has more potentially overhead due to atomic read/write in the fragment shader. The largest performance impact I have been reported is from nightly 110%+ speed -> this branch 105% speed, running monster hunter gen/x/xx on Nvidia GPU.

The advantage of this method:

  • Accurate soft shadow emulation
  • No texture reinterpretation between RGBA8 and D24S8. The format reinterpretation and decoding is done natively by image load/store and shader program.
    • In #3729 the reinterpretation caused huge performance impact. To mitigate it an ugly workaround is added, which however still involves pixel downloading/uploading, and cause serious issue on most AMD and some Intel GPU
    • D24S8 shadow sampler still has an unexplained issue in AMD.
    • In general D24S8 is a danger area where GPU driver likely to make mistakes, because it is not a usual texture format - different GPU might split it into two textures and transfer data in a special (not well-tested) way. By avoiding D24S8, we get defined GPU behavior - either do it correctly, or do not render shadow at all (in case of unsupported extensions)
  • No change made in the rasterizer. Treat shadow map as RGBA8 in the same way as PICA does
    • However there is a small flaw. See the comment in the code

Note that this solution is parallel with #3729. If needed, #3729 can still be added as a fallback path.


In an extreme edge case this doesn't guarantee accurate emulation though. GPU is allowed to reorder primitives and the values rendered to the same pixel might have a different order from the intended order. Recall the PICA shadow map framebuffer combiner:

if (color == 0)
    depth = min(depth, fragment_depth);
else
    if (fragment_depth < depth)
         stencil = min(stencil, f(color));

which is almost order-neutral, if viewed as a double MIN blending. The order only matters when fragments with both color==0 and color!=0 are mixed in one primitive branch, which shouldn't happen in a normal program design: color!=0 is usually in a second pass after color==0 to "decorate" the soft shadow region. If we want to take care of this extreme case, another OpenGL extension, ARB_fragment_shader_interlock, is required. Unfortunately, it has a base OpenGL version requirement of 4.2, and it is not supported by many GPUs.


For anyone that finds the format reinterpretation/decoding confusing in this code, here is a simple explanation:

The native PICA RGBA8 format is {A, B, G, R} in bytes layout.
The native PICA shadow map format is {Dh, Dm, Dl, S} in bytes layout (D = depth, S = stencil, h = high byte, m = medium, l = low). Note that his is different from PICA D24S8 format ({Dl, Dm, Dh, S})
So we can say, in PICA, A=Dh, B=Dm, G=Dl, S=R

PICA RGBA8 textures is uploaded to OpenGL as RGBA8. This OpenGL texture is then bound to shader as a r32ui image. According to the format conversion rule of image load/store, the data is transferred as if the pixel is downloaded to client with the corresponding external format of the source format, then uploaded with the corresponding external format of the destination format. The "external format" here specified for RGBA8 is {GL_RGBA, GL_BYTE}, meaning that the downloaded pixel would be {R, G, B, A}={S, Dl, Dm, Dh} in bytes layout; the destination format is a u32 integer. Assuming little-endian platform, the destination value, viewed by the shader, would be ABGR = DhDmDlS in u32 layout (left = most significant byte). Therefore to decode the shadow map in the shader, one just needs to do Depth = DhDmDl = DhDmDlS >> 8; Stencil = DhDmDlS & 0xFF;

Granted, this relies on the assumption of little-endian platform. On big-endian platform, byte swapping needs to be done in the shader. We can leave this until we actually support big-endian platforms.


I didn't add the state tracking boilerplate of image binding in gl_state, because it is only used in the rasterizer main rendering, and it itself has too many state to track (format, readonly/writeonly...), which doesn't worth to do it.


This change is Reviewable

@cluezbot

This comment has been minimized.

cluezbot commented May 25, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-25T16:46:10Z: 1c03ff1...wwylele:57b31c7d36dc7c947731bb3287c4570b42b68c13

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from 57b31c7 to ecf922b May 25, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 25, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-25T16:49:11Z: 1c03ff1...wwylele:ecf922b708e92ea4867ca8e73a6bf6ec79a6b04f

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from ecf922b to dc288f5 May 25, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 25, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-25T17:17:16Z: 1c03ff1...wwylele:dc288f53f669b84e659ef7e1599ebb2008d8c85f

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from dc288f5 to 71cb9c8 May 25, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 25, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-25T17:38:26Z: ba4a052...wwylele:71cb9c8617ca78fcc3062b80b3108346941f15c9

@tywald

This comment has been minimized.

tywald commented May 25, 2018

Comparison between Nightly-698 and this PR at 5x resolution with i7-3770K @ 4.3GHz and GTX 680 in MHGen:
Nightly-698: ~5.80ms
Appveyor(d3aa250): ~6.25ms

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from 71cb9c8 to 4ad33ec May 26, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 26, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-26T12:01:49Z: 03209ef...wwylele:4ad33ec0ec5090a7c500b39e4641a8131c6697a6

@wwylele wwylele changed the title from gl_rasterizer: implement shadow map 2D - Image load/store version to gl_rasterizer: implement shadow map 2D/Cube - Image load/store version May 26, 2018

@wwylele

This comment has been minimized.

Member

wwylele commented May 26, 2018

I went ahead to implement shadow map cube as well, because I figured that putting it into another PR would rewrite the GLSL part again.

btw, CTRAging contains tests for soft shadow and shadow map cube, which run fine on this branch.

vec4 shadowTextureCube(vec2 uv, float w) {
vec3 c = vec3(uv, w);
vec3 a = abs(c);
if (a.x > a.y && a.x > a.z) {

This comment has been minimized.

@wwylele

wwylele May 26, 2018

Member

I do realize that the if here in GLSL would produce a large amount overhead on repeating SampleShadowCubeFace and discarding. The only way to resolve this is to use a real texture cube/array image. However, in the current form of our texture cache, it would require copying texture surface from 2D to cube/array all the time (just like how we do for normal texture cube), which I guess would cause even large overhead than this.

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from 4ad33ec to 6b73207 May 26, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 26, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-26T21:47:53Z: 09982c3...wwylele:6b7320783cfb705c1e070471e81d2d9656ef45ef

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from 6b73207 to be609bc May 28, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 28, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-28T20:03:53Z: 09982c3...wwylele:be609bc552a8580ad511ff33b0e88fc1b656cf85

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from be609bc to cf02470 May 28, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 28, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-28T20:15:57Z: 65f5bc7...wwylele:cf02470e80d6a91c0d0e138b4d9d318c3f0272f5

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from cf02470 to 08d7b59 May 31, 2018

@cluezbot

This comment has been minimized.

cluezbot commented May 31, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-05-31T22:42:32Z: 72f9142...wwylele:08d7b594a673d099f2b6b9308625baa538bda495

@wwylele wwylele force-pushed the wwylele:shadow-hw-image-load-store branch from 08d7b59 to 781912e Jun 1, 2018

@cluezbot

This comment has been minimized.

cluezbot commented Jun 1, 2018

Hi, this is neobot, using neobrain's account. I'm keeping an archive of versions of this PR:

2018-06-01T11:26:53Z: 08b1191...wwylele:781912e854863d6f565eab8b53b5b0869ab27ab9

@jroweboy

This comment has been minimized.

Member

jroweboy commented Jun 2, 2018

Report from a user on the forum who had issues with the old code says this new version fixes the performance regression: https://community.citra-emu.org/t/low-fps-in-a-good-pc/22662/10?u=jroweboy

return vec4(mix2(s, f));
}
vec4 shadowTextureCube(vec2 uv, float w) {

This comment has been minimized.

@degasus

degasus Jun 12, 2018

Contributor

I wonder why are you using images here instead of textureGather(usamplerCube). I see you need images to write to the shadow_buffer image, but the shadow_texture_* images can be by accessed with a common texture easier.

This comment has been minimized.

@wwylele

wwylele Jun 12, 2018

Member

I use image instead of texture mainly in order to reinterpret RGBA8 as U32. I know that I can achieve the same with texture view, but... don't really like to do more texture object management and don't want to manage more extensions.

texture cube / image cube is not used here because I don't find a way to attach 6 textures to one cube object = I have to either copy six faces every frame, or I have to rework the texture cache system.

@wwylele

This comment has been minimized.

Member

wwylele commented Jun 22, 2018

I am going to merge this if no more comments

@wwylele wwylele merged commit f50e505 into citra-emu:master Jun 22, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@wwylele wwylele deleted the wwylele:shadow-hw-image-load-store branch Jun 22, 2018

@MerryMage MerryMage referenced this pull request Jul 8, 2018

Closed

Progress Report 2018 Q2 #85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment