Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VideoCommon: Handle emboss texgen with only a single normal #10584

Merged
merged 8 commits into from Apr 23, 2022

Conversation

Pokechu22
Copy link
Contributor

Replacement to #10317. This time, the result is much closer to hardware.

@phire linked me to this 2002 article from a Factor 5 employee, which covers a lot about how they handle shading in the Rogue Squadron 2. Specifically, the second paragraph of the "Landscape Shader Optimizations" section says this:

A trick that is worth mentioning is how to avoid sending the same bi-normals and tangents for emboss mapping repeatedly to the transform unit (XF) of the graphics processor. It turns out that if these vectors are not present in the vertex format, XF will provide the previously transformed bi-normal and tangent, which reside in internal registers. Thus, if a dummy triangle is drawn with the bi-normal and tangent immediately before the landscape is drawn, then there is no need to send the same vectors over again for the rest of the height-map triangles. This means that only one vertex format is needed for the entire landscape, and it saves memory, transfer bandwidth and most importantly transform performance.

This is somewhat awkward to implement, but it isn't much more complicated than z-freeze (though the current implementation might be updating a vertex shader constant more often than it needs to be, resulting in a performance hit, as I didn't see a fast way of checking if an emboss texgen is in use without 3 normal vectors).

I have notes in this gist, though they are not particularly well-organized.

Here are some screenshots (using the hardware fifoplayer):

RS2 - dolphin with this PR

1_y0_x0
2

RS2 - real hardware

1_y0_x0
2

RS3 - dolphin with this PR

1_y0_x0
2

RS3 - real hardware

3_y0_x0
4

RS2 looks pretty much perfect to me, but RS3 has some lingering issues (e.g. the right foot of the nearest AT-AT is wrong). (Also, the sky is a different blue, and as phire mentioned luke's clothing (most obvious with the belt) doesn't have the proper shinyness.)

This PR also includes a few zfreeze-related adjustments, because I'm already working in that area.

I updated both the ARM and x64 vertex loaders, but I haven't tested the ARM vertex loader myself. Given that this is generated assembly, it probably won't work right.

@phire
Copy link
Member

phire commented Apr 14, 2022

Well the terrain looks pretty close to perfect in both RS2 and RS3.

But there are major issues with the AT-AT. When you zoom in, all 4 feet are wrong, it's just the back-right one is more wrong than the others. Dolphin always has much less emboss than real hardware, Especially with the horizontal ring around the foot.
The second-closest AT-AT has the same issues. At least the embossing is in the right direction.

I'm wondering if perhaps the binormals get cached before being multiplied with the normal matrix? That seems to produce a result closer to the real hardware.

BTW, This is a debug image I made a few days ago by editing a few shaders:

image

What it shows is that the back-rear foot of the AT-AT has a very different normal matrix to the other three feet. That might be why is it shows up so differently to the other foot in dolphin.

Source/Core/VideoCommon/VertexLoaderX64.cpp Outdated Show resolved Hide resolved
Source/Core/VideoCommon/VertexLoaderX64.cpp Outdated Show resolved Hide resolved
@Pokechu22 Pokechu22 force-pushed the emboss-single-normal-v2 branch 2 times, most recently from 20329a7 to 384097e Compare April 14, 2022 22:20
@Pokechu22
Copy link
Contributor Author

I'm wondering if perhaps the binormals get cached before being multiplied with the normal matrix? That seems to produce a result closer to the real hardware.

I tried that out, and it gives correct results! It doesn't exactly match what Factor 5 wrote in the article, but maybe they only discovered this behavior when working on RS3 (and thus they only change the normal matrix in RS3 and not in RS2)? I've updated my notes with some details and images from the hardware fifoplayer that helped me confirm this (basically, I manually disabled the normal matrix changes for the nearest AT-AT and the embossing changed in addition to the lighting).

Here's some new images of it:

RS3 - Dolphin

1_y0_x0
2
3_y0_x0
4

RS3 - hardware

1_y0_x0
2
5_y0_x0
6

@Pokechu22 Pokechu22 force-pushed the emboss-single-normal-v2 branch 3 times, most recently from 7ee2d8e to 067adad Compare April 14, 2022 22:47
@phire
Copy link
Member

phire commented Apr 14, 2022

I tried that out, and it gives correct results!

Comparing your screenshots, I agree. This produces a very close match.

It doesn't exactly match what Factor 5 wrote in the article, but maybe they only discovered this behavior when working on RS3 (and thus they only change the normal matrix in RS3 and not in RS2)?

I'm not even sure what effect they are expecting here. It seems to be simply applying a uniform level of emboss across the whole object. It doesn't respond to light on a per-vertex basis (apart from the fact that TEV is configured to only show the emboss result on lit surfaces). I feel like they could have gotten better performance by calculating the emboss offset in software, skipping the texgen and applying it directly in TEV. Or baking the emboss into a texture into a texture and skipping a TEV stage.

Maybe they did it this way just so the emboss direction would respond to the light direction?

@Pokechu22 Pokechu22 marked this pull request as ready for review April 15, 2022 00:42
@JosJuice
Copy link
Member

I don't have the knowledge needed to understand the PR overall, but I've tested and reviewed the AArch64-specific parts and they seem fine.

Copy link
Member

@phire phire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Comment on lines +98 to +99
MultiplyVec3Mat33(src->normal[1], mat, dst->normal[1]);
MultiplyVec3Mat33(src->normal[2], mat, dst->normal[2]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It bugs me that this is always wasting resources transforming the binomials when they might not be used. Especially when the transformed binormals are only used by emboss map.
Makes me ponder about what might be happening on the real hardware.

I really doubt it's looking ahead to see if there going to be an emboss map texgen or not.

Maybe XF can transform these binormals in parallel with earlier texgens on unused resources? Or maybe it actually applies the normal transform when it actually gets used in the emboss map texgen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to test what happens when using a regular texgen with the source row as BinormalT or BinormalN. Only the regular texgen allows for use of one or two texture coordinate matrices, so maybe the time that would be spent for that is used to transform the normals otherwise? (I'm not sure what would be happening with Color0 or Color1 though, as I'd assume the color would already be transformed by that point so I don't know what extra work it would be doing)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well when regular texgen use binormals as source, they get the raw binormals without any transform.

It's also worth noting that diagrams in the patents show separate functional units for applying position/normal transforms (dot product unit) and texgen's texture coordinate matrix operations (texture dot 2; Do texture coordinate matrix operations happen in fixed point?), so as long as there are a few texgen operations before any emboss map operations, the binomals transform would have time to complete in parallel

(And then lighting is done in a completely separate pipeline with it's own functional units)

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also odd that the article says:

Finally, emboss mapping (as bump mapping) needs binormals to describe the orientation of the height field on the surface. Since they need to be transformed the same way the normals are transformed this can add a bit overhead.

and also

This means that only one vertex format is needed for the entire landscape, and it saves memory, transfer bandwidth and most importantly transform performance.

That implies that it is faster (and they could have measured the performance improvement with GX_PERF0_XF_XFRM_CLKS, and it's likely they did). The article doesn't mention changing the normal matrix, though, and says that "XF will provide the previously transformed bi-normal and tangent, which reside in internal registers", which is inconsistent with what we've determined. (The article was written for RS2, which doesn't change the normal matrix while using this trick (that I've seen), so this may just have been their understanding of how it worked at the time.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really suspect faulty-recall here. That they forgot exactly what they were optimising for between when they did the optimisation and when they wrote the article.

It doesn't seem like they were actually optimising for GX_PERF0_XF_XFRM_CLKS. If I'm understanding the graphical effect correctly, the amount of emboss is identical across the entire frame.
It feels like there should have been multiple ways to achieve the same effect that would have been cheaper for both XF and TEV performance, at the expense of some per-frame overhead. For example, baking the summed emboss effect into a texture each frame would have reduced the number of texture lookups by one, potentially eliminated a TEV stage and or or two XF texgen channels.

With the fact that we know the transform gets applied anyway, and with the wild overlapping of color channels, it really feels like they are optimising solely for memory bandwidth. Both CPU -> RAM bandwidth, and RAM -> CP bandwidth.

Source/Core/VideoCommon/VertexShaderGen.cpp Outdated Show resolved Hide resolved
Comment on lines 231 to 242
if (!vdec.normals[1].enable)
{
m_vertex.normal[1][0] = VertexShaderManager::constants.cached_binormal[0];
m_vertex.normal[1][1] = VertexShaderManager::constants.cached_binormal[1];
m_vertex.normal[1][2] = VertexShaderManager::constants.cached_binormal[2];
}
if (!vdec.normals[2].enable)
{
m_vertex.normal[2][0] = VertexShaderManager::constants.cached_tangent[0];
m_vertex.normal[2][1] = VertexShaderManager::constants.cached_tangent[1];
m_vertex.normal[2][2] = VertexShaderManager::constants.cached_tangent[2];
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just realized that these names are backwards; based on the order the components are listed in table III of US6424348B2, it's normal, tangent, binormal (although other parts refer to GX_VA_NBT and normal/binormal/tangent). See also this code and the values for BinormalT and BinormalB.

Since I've consistently had binormal be at index 1 and tangent at index 2, this doesn't result in any emulation difference, but I should probably swap the names for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've swapped the names.

@Pokechu22 Pokechu22 force-pushed the emboss-single-normal-v2 branch 2 times, most recently from 3da5d51 to fba34a7 Compare April 20, 2022 01:03
Comment on lines -360 to -362
// The following assert was triggered in House of the Dead Overkill and Star Wars Rogue
// Squadron 2
// ASSERT(0); // should have normals
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've tested RS2 via fifoci, but I couldn't find any information about this occurring in House of the Dead Overkill. @JMC47 added this comment in #2727. Any idea where that occurs?

if ((uid_data->components & VB_HAS_NRM0) != 0)
{
if ((uid_data->components & VB_HAS_NRM1) == 0)
out.Write("float3 rawnorm1 = " I_CACHED_TANGENT ".xyz;\n");
Copy link
Contributor

@iwubcode iwubcode Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling these normals is a bit strange. AFAIK a normal is typically a single vector that is perpendicular to the surface.

Maybe instead of rawnorm1 we use rawtangent. Same for rawnorm2 to be rawbinormal.

I imagine we also can change the _norm1 as well. I'm not sure yet what the N0, N1, etc values are so I can't say whether that variable makes sense. But the other values will then be consistent with general graphics operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N0, N1, and N2 are the rows of the normal transformation matrix; P0, P1, and P2 are similarly the rows of the position transformation matrix.

Renaming rawnorm1 and _norm1 could be a bit invasive, but I'm not opposed to doing it. (It'd also make sense to rename _norm0 to just _norm.) I think this naming scheme mostly comes from the vertex loader code referring to a normal count (either 1 or 3), and in turn Nintendo somewhat treating things the same way (though it's either N or NBT).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also this (for D3D):

// inputs
if ((uid_data->components & VB_HAS_NRM0) != 0)
out.Write(" float3 rawnorm0 : NORMAL0,\n");
if ((uid_data->components & VB_HAS_NRM1) != 0)
out.Write(" float3 rawnorm1 : NORMAL1,\n");
if ((uid_data->components & VB_HAS_NRM2) != 0)
out.Write(" float3 rawnorm2 : NORMAL2,\n");

D3D has dedicated semantics for NORMAL/TANGENT/BINORMAL, though it doesn't explain what they do differently. I'll switch to those too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks pokechu, I think that'll make it easier for anyone who is more familiar with graphics logic to understand what is going on. I might end up using these for an enhancement type feature in the near future.

I'm still curious why N0, N1, N2 forms a separate matrix from the normal/binorma/tangent. I'm not a graphics guru but I just find it surprising, I wonder what its purpose was.

Copy link
Contributor

@iwubcode iwubcode Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's to change the space of the vectors. At least it looks like the normal/binormal/tangent are being orthnormalized to the new space? (been a while since I've done that so I might be wrong)

EDIT: guess model -> tangent space?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to set #pragma pack_matrix( row_major ) for it to work as expected

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't get matrices working, I think we should add a comment where the matrix multiply is done (with dot products) to make it clear to the reader that a matrix multiply is being implemented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And put the comment in the actual generated shader source... not just the code that generates it.

We kind of want those shaders to be readable too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done that for the first one (with the position matrix), but the same pattern is used throughout for both specialized shaders and ubershaders for normals, the projection matrix, texture coordinate transformation, and dual tex transform. Adding it for each seems a bit messy, and is non-trivial for the texture coordinate ones due to how they're generated; it'd be nice to refactor it but I think it's out of scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out-of-scope is fine with me.

@@ -202,7 +233,9 @@ int VertexLoaderX64::ReadVertex(OpArg data, VertexComponentFormat attribute, Com
dest.AddMemOffset(sizeof(float));

// zfreeze
if (native_format == &m_native_vtx_decl.position)
if (native_format == &m_native_vtx_decl.position ||
native_format == &m_native_vtx_decl.normals[1] ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be this review, but could (should?) this be normal, binormal and tangent? Are we gaining anything by them being in an array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the if (vert_decl.normals[1].enable) would read more clearly below as if (vert_decl.binormal.enable).

(I know I'm nitpicking a bit here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's convenient to be able to iterate over them because they all share the same format, but it's a lot more messy now that special-casing is needed (on master there isn't any situation where indices 1 and 2 were directly referenced like this).

// Only update the binormal/tangent vertex shader constants if the vertex format lacks binormals
// (VertexLoaderManager::binormal_cache gets updated by the vertex loader when binormals are
// present, though)
if (vert_decl.normals[1].enable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ever a case where normal[1] will be set and normal[2] will not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The 3 options are for the vertex format are:

  • None
  • Normals only
  • Normals + Binormals + Tangents.

Copy link
Contributor Author

@Pokechu22 Pokechu22 Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The existing code doesn't make that super clear though (with separate VB_HAS_TANGENT/VB_HAS_BINORMAL (VB_HAS_NRM1 and VB_HAS_NRM2) constants, and the array). There's either 0 normals (vtx_desc.low.Normal == VertexComponentFormat::NotPresent), 1 normal (vtx_desc.low.Normal != VertexComponentFormat::NotPresent and vtx_attr.g0.NormalElements == NormalComponentCount::N), or 3 normals (vtx_desc.low.Normal != VertexComponentFormat::NotPresent and vtx_attr.g0.NormalElements == NormalComponentCount::NBT). (There's also the NormalIndex3 mode, which allows choosing between specifying a single index for the normal, binormal, and tangent, or specifying one index for each of them.) Relevant code: here and here (roughly).

It also isn't possible for normal[1] to be set but normal[0] to be unset (which again the logic doesn't make obvious).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for clarifying!

Copy link
Member

@phire phire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this

This is a readability change; there should be no functional or performance differences.
Before, it would always write to index 0 (which is unused).  Now it writes to the correct index.
(Specifically, the copy for VertexLoaderManager::position_cache.  The position matrix index happens elsewhere, and the float path still has special logic to copy to scratch3.)
Pokechu22 and others added 4 commits April 22, 2022 16:54
This more accurately represents what's going on, and also ends at 0 instead of 1, making some indexing operations easier.  This also changes it so that position_matrix_index_cache actually starts from index 0 instead of index 1.
Fixes a large number of effects in Rogue Squadron 2 and 3.
…rmalized

Co-authored-by: Scott Mansell <phiren@gmail.com>
@dolphin-emu-bot
Copy link
Contributor

FifoCI detected that this change impacts graphical rendering. Here are the behavior differences detected by the system:

  • rs2-bumpmapping on ogl-lin-mesa: diff
  • rs2-skybox on ogl-lin-mesa: diff
  • rs3-bumpmapping on ogl-lin-mesa: diff
  • rs2-bumpmapping on sw-lin-mesa: diff
  • rs2-skybox on sw-lin-mesa: diff
  • rs3-bumpmapping on sw-lin-mesa: diff
  • rs2-bumpmapping on ogl-lin-radeon: diff
  • rs2-skybox on ogl-lin-radeon: diff
  • rs3-bumpmapping on ogl-lin-radeon: diff
  • rs2-bumpmapping on uberogl-lin-radeon: diff
  • rs2-skybox on uberogl-lin-radeon: diff
  • rs3-bumpmapping on uberogl-lin-radeon: diff

automated-fifoci-reporter

@JMC47
Copy link
Contributor

JMC47 commented Apr 23, 2022

Multiple approvals and it touches Rogue Squadron 2/3? What have I done to deserve such a bounty.

@JMC47 JMC47 merged commit 56bb965 into dolphin-emu:master Apr 23, 2022
Pokechu22 added a commit to Pokechu22/dolphin that referenced this pull request May 6, 2022
Basically, this disables the offset based on the light in all cases (treating it as 0, and copying the original coordinate unchanged), which should break all emboss map effects.  Thus we can see what games use these effects and what they'd look like without them.  This includes emboss effects that have binormal and tangent vectors specified, which already worked in Dolphin prior to dolphin-emu#10584.
Pokechu22 added a commit to Pokechu22/dolphin that referenced this pull request May 6, 2022
Basically, this disables the offset based on the light in all cases (treating it as 0, and copying the original coordinate unchanged), which should break all emboss map effects.  Thus we can see what games use these effects and what they'd look like without them.  This includes emboss effects that have binormal and tangent vectors specified, which already worked in Dolphin prior to dolphin-emu#10584.
@Pokechu22
Copy link
Contributor Author

As a side note, I ended up finding the expired patent for the emboss effect while looking for (mostly) unrelated stuff: US7307640B2. The relevant sections are "Example Emboss-Style Bump Mapping Texture Coordinate Generation" (PDF column 10) and "Example Emboss Bump-Mapping Texture Coordinate Generation Hardware Implementation" (PDF column 13). It's mostly not new information, but it does note that they handle normalization by computing dot(_tangent, light_pos - vertex_pos) / sqrt(lengthsquared(light_pos - vertex_pos)) where the numerator and denominator are computed in parallel using two separate dot product units (column 14 line 34). This is slightly different from how we do it in terms of performance (mathematically it should be the same, unless things overflow). It also mentions "at least twenty-bit floating point numerical values" (column 14 line 66), and has a chart giving cycle timings (column 15). More importantly, it gives (semi-)consistent terminology to use for everything.

That patent does not mention the trick Factor 5 is doing, and even says "(For the present example embodiment, even where the supplied binormals are constant, for example, with flat surfaces, Command Processor 200 supplies the binormals to Transform Unit 300 on a per-vertex basis.)" (column 12 lines 11-14).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
7 participants