Merge Squid's VBO optimizations, add some more, add canvas support #83

toad-dev · 2022-04-07T06:55:26Z

Merged @SquidDev's VBO backend optimizations. This gave us a huge boost in monitor rendering performance when running in shader mod compatibility mode.
Squid also fixed the TBO backend not respecting fog.
Added a few more optimizations to the VBO backend myself.
Fixed an incompatibility with Canvas and the VBO backend.

At this point we have pretty good shader mod support. All users have to do is leave the monitor renderer option on BEST and the mod will support both Iris and Canvas. The only downside is that the VBO backend sometimes displays a slight stitching artifact. Fixing this would come at a performance penalty so I'm going to leave it as is for now.

- Remove the POSITION_COLOR render type. Instead we just render a background terminal quad as the pocket computer light - it's a little (lot?) more cheaty, but saves having to create a render type. - Use the existing position_color_tex shader instead of our copy. I looked at using RenderType.text, but had a bunch of problems with GUI terminals. Its possible we can fix it, but didn't want to spend too much time on it. - Remove some methods from FixedWidthFontRenderer, inlining them into the call site. - Switch back to using GL_QUADS rather than GL_TRIANGLES. I know Lig will shout at me for this, but the rest of MC uses QUADS, so I don't think best practice really matters here. - Fix the TBO backend monitor not rendering monitors with fog. Unfortunately we can't easily do this to the VBO one without writing a custom shader (which defeats the whole point of the VBO backend!), as the distance calculation of most render types expect an already-transformed position (camera-relative I think!) while we pass a world-relative one. - When rendering to a VBO we push vertices to a ByteBuffer directly, rather than going through MC's VertexConsumer system. This removes the overhead which comes with VertexConsumer, significantly improving performance. - Pre-convert palette colours to bytes, storing both the coloured and greyscale versions as a byte array. This allows us to remove the multiple casts and conversions (double -> float -> (greyscale) -> byte), offering noticeable performance improvements (multiple ms per frame). We're using a byte[] here rather than a record of three bytes as notionally it provides better performance when writing to a ByteBuffer directly compared to calling .put() four times. [^1] - Memorize getRenderBoundingBox. This was taking about 5% of the total time on the render thread[^2], so worth doing. I don't actually think the allocation is the heavy thing here - VisualVM says it's toWorldPos being slow. I'm not sure why - possibly just all the block property lookups? [^2] Note that none of these changes improve compatibility with Optifine. Right now there's some serious issues where monitors are writing _over_ blocks in front of them. To fix this, we probably need to remove the depth blocker and just render characters with a z offset. Will do that in a separate commit, as I need to evaluate how well that change will work first. The main advantage of this commit is the improved performance. In my stress test with 120 monitors updating every tick, I'm getting 10-20fps [^3] (still much worse than TBOs, which manages a solid 60-100). In practice, we'll actually be much better than this. Our network bandwidth limits means only 40 change in a single tick - and so FPS is much more reasonable (+60fps). [^1]: In general, put(byte[]) is faster than put(byte) multiple times. Just not clear if this is true when dealing with a small (and loop unrolled) number of bytes. [^2]: To be clear, this is with 120 monitors and no other block entities with custom renderers. so not really representative. [^3]: I wish I could provide a narrower range, but it varies so much between me restarting the game. Makes it impossible to benchmark anything!

…bo-optimizations

Somehow this hits a happier path in the JVM. I guess it has trouble inlining the VertexEmitter.vertex() calls because there are multiple implementations, so reducing the number of calls and giving it a chunkier function to JIT down helps? This is all conjecture because I haven't figured out JitWatch yet :) Anyways, this gives about a 9% improvement in my tests.

This gives about a 3% improvement in VBO rebuild stress tests, for the cost of a little more memory. getVertexCount() was showing up heavy in my profiles. Changing it to a simple upper bound calculation melts that time away. If there's a max size 0.5 text scale monitor in the scene, the buffer will grow to ~3 MB. For comparison's sake, the images in the "blit" program were already growing the buffer to ~2.1 MB.

Canvas handles the matrix stack a little differently. We have to multiply in the modelView matrix ourselves. At this point we support both Iris and Canvas to an acceptable level. Users just have to leave the monitor renderer option on BEST and everything will be handled for them.

SquidDev and others added 5 commits April 2, 2022 10:54

Merge remote-tracking branch 'tweaked/mc-1.18.x' into feature/merge-v…

44f61e9

…bo-optimizations

Merith-TK merged commit f10f987 into cc-tweaked:mc-1.18.x/1.18.2 Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Squid's VBO optimizations, add some more, add canvas support #83

Merge Squid's VBO optimizations, add some more, add canvas support #83

toad-dev commented Apr 7, 2022

Merge Squid's VBO optimizations, add some more, add canvas support #83

Merge Squid's VBO optimizations, add some more, add canvas support #83

Conversation

toad-dev commented Apr 7, 2022