feat(core): GPU instancing auto-batching#2957
Conversation
…nce data Introduce automatic GPU instancing for MeshRenderer. The system scans renderer-group uniforms across shader passes, builds a unified std140 UBO layout, and packs per-instance data (ModelMat, Layer, etc.) each frame. Key changes: - InstanceDataPacker: packs renderer data into shared UBO for instanced draw - ShaderFactory: unified _scanInstanceUniforms, _buildLayout, _injectInstanceUBO - MeshRenderer._canBatch/_batch: instancing merge logic - ShaderPass/SubShader: instance-aware compilation with macro cache - GLSLIfdefResolver: compile-time #ifdef resolution for instance field scanning - MacroCachePool: pooled ShaderMacroCollection for shader program caching - RenderQueue: instance-aware draw path with UBO binding
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds WebGL2 GPU instancing: instance UBO management, instance-aware shader compilation and caching, batching API/signature changes, render-queue instanced draw paths, device/GL helpers, examples and E2E tests, and supporting shader/uniform enhancements. Changes
Sequence Diagram(s)sequenceDiagram
participant RenderQueue as RenderQueue
participant BatcherManager as BatcherManager
participant InstanceBatch as InstanceBatch
participant GPUBuffer as GPU Constant Buffer
participant ShaderProgram as ShaderProgram
RenderQueue->>BatcherManager: request instanceBatch (lazy)
BatcherManager->>InstanceBatch: setLayout(layout)
InstanceBatch->>GPUBuffer: create/realloc UBO (if needed)
loop per instanced chunk
RenderQueue->>InstanceBatch: upload(renderers[], start, count)
InstanceBatch->>InstanceBatch: pack per-instance fields into CPU buffer
InstanceBatch->>GPUBuffer: setData(range, Discard)
end
RenderQueue->>ShaderProgram: bindUniformBlocks(bindingMap)
ShaderProgram->>GPUBuffer: uniformBlockBinding(bindingPoint)
RenderQueue->>RenderQueue: issue drawPrimitive with instanceCount
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Move _gpuInstanceMacro after _macroMap declaration to fix static initialization order. Also apply prettier formatting fixes.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## dev/2.0 #2957 +/- ##
===========================================
- Coverage 77.38% 77.08% -0.30%
===========================================
Files 900 907 +7
Lines 98752 99807 +1055
Branches 9817 9866 +49
===========================================
+ Hits 76415 76933 +518
- Misses 22170 22703 +533
- Partials 167 171 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…es layout The instance UBO is injected at compile time by ShaderFactory, so the original shader uniform declarations don't need modification.
Bring back normalMat extraction, transform_declare reordering, trailing whitespace fixes, and VertexPBR indent fix. Only the renderer_Layer relocation stays reverted.
… 3×vec4 - Fix SubRenderElement.set() not resetting instanceDataPacker, causing stale packer references from previous frames to break all batching - Use whitelist + _group fallback for identifying renderer uniforms in _scanInstanceUniforms (fixes _group===undefined for ModelMat) - Store ModelMat as 3×vec4 (affine rows) instead of mat4 in UBO, saving 16 bytes per instance (structSize 80→64, +25% instances/batch) - Add camera_VPMat to transform_declare.glsl for derived MVP define - Extract struct definition outside uniform block (GLSL ES 3.00 compat) - Fix _insertUBOBlock to only scan initial #define preamble - Pass instanceID to fragment shader via flat varying
…lity Wrap derived NormalMat define with mat4() so instancing and non-instancing paths both produce mat4, avoiding shader compilation errors. Add custom instance data example to verify per-renderer uniform batching.
Rename elementA/elementB to preSubElement/subElement across all renderer subclasses and BatchUtils. Change _batch signature so preSubElement is nullable (null = batch head, no previous element to merge with), and subElement is always required.
- Move RENDERER_GPU_INSTANCE macro from ShaderMacro to InstanceDataPacker - Rename getOrCreate() to get() in InstanceDataPackerPool - Clear compileMacros in InstanceDataPacker.reset()
Move macro merging, layout computation, and UBO packing from batch phase to render phase. _batch now only collects renderers into a pre-allocated list. RenderQueue.render handles macro union, layout lookup, and splits by maxInstanceCount for sub-batch rendering. - SubRenderElement: instanceDataPacker → instancedRenderers (pooled array) - MeshRenderer._canBatch: remove maxInstanceCount check - MeshRenderer._batch: only push renderers, zero allocation - InstanceDataPacker: remove compileMacros/addRenderer/instanceCount, add packAndUpload(renderers, start, count) - InstanceDataPackerPool: remove uploadBuffer, simplify reset - BatcherManager: remove instancing uploadBuffer call
Packer is now stateless (setLayout + packAndUpload + draw), so only one instance is needed. Discard upload ensures no GPU stall when reusing the same buffer across shadow and main passes. - Delete InstanceDataPackerPool.ts - BatcherManager: instanceDataPackerPool → instanceDataPacker - Remove resetInstanceDataPackerPool lifecycle - Saves GPU memory by using single buffer instead of pool
Rename class, file, and all references to better reflect its role as an instance batch manager rather than a generic data packer.
It's a macro-keyed map, not a pool (no borrow/return semantics).
Align variable and method names with MacroMap rename — these are maps, not pools.
Callers always pass a valid buffer, no need for null guard.
- nativeBuffer → buffer, _uboData → _data - Replace separate instanceFields/_structSize with single _layout ref - setLayout() now takes InstanceLayout directly
- Remove unnecessary null guard on _layout - Inline uploadElements variable - Destructure floatView/intView from this - Improve worldMatrix comment
- Remove unnecessary component→renderer alias, use component directly - Hoist bindUniformBlocks/bindUniformBufferBase out of sub-batch loop - Move primitive.instanceCount=0 after loop (only need to reset once) - Remove redundant let layout = undefined
- Upgrade lint-staged from v10.5 to v16.4.0
- Fix glob from *.{ts} to **/*.ts to match subdirectory files
- Remove redundant git add from tasks
- eslint 8.44 → 8.57 - @typescript-eslint/parser and eslint-plugin 6.x → 8.x - Eliminates "unsupported TypeScript version" warning
Dispose means the object is permanently released, so null the array to free memory instead of just clearing length.
Eliminate redundant SubShader._getInstanceLayout / ShaderPass._scanInstanceFields by reusing the shader compilation chain — _injectInstanceUBO now returns InstanceLayout directly, stored on ShaderProgram._instanceLayout.
Remove the RenderElement container layer and promote SubRenderElement (renamed to RenderElement) as the direct sort/render unit. Previously, sorting was done at the container level while batching operated on individual sub-elements, causing multi-submesh objects to break batches. Now sorting and batching are aligned at the same granularity, allowing same-material elements from different objects to be properly batched together, reducing draw calls.
The batched field on RenderElement was written by BatcherManager but never effectively consumed - 2D overrides hardcoded true, 3D path skipped via isInstanced. Remove the field, _batchedTransformShaderData cache, and consolidate transform update methods into two clear paths: _updateTransformShaderData (3D) and _updateWorldSpaceTransformShaderData (2D).
…o InstanceBufferLayout
…Queue - BatchUtils -> VertexMergeBatcher (file and class); batchFor2D -> batch - BatcherManager.instanceBatch -> instanceBuffer (align with class name) - canBatchSprite: reorder conditions for better short-circuit - RenderQueue: hoist needMaskType out of loop, add phase comments
# Conflicts: # packages/shader/src/shaders/Transform.glsl
|
补充 [P0] RenderElement.ts:dispose —
dispose(): void {
- this.instancedRenderers = null;
+ this.instancedRenderers.length = 0;
}(来自深度 review 补充,上一条 review 中遗漏) |
从 PR #2957 (feat/gpu-instancing) cherry-pick 69 个 commit 到 fix/shaderlab 分支,压缩为单次提交。 ## 核心改动 ### 新增文件 - `InstanceBuffer.ts` — UBO-based 实例数据管理,将多个 renderer 的 worldMatrix/shaderData 打包到一个 UBO 中 - `VertexMergeBatcher.ts` — 替代旧 BatchUtils,统一 2D/3D 合批入口 - `ShaderBlockProperty.ts` — UBO block 属性描述 - `ShaderProgramMap.ts` — 替代旧 ShaderProgramPool,支持 instancing layout 缓存 - `ConstantBufferBindingPoint.ts` — UBO binding point 枚举 ### 修改文件 - `MeshRenderer._canBatch/_batch` — 同 mesh+material+macros 自动合批判定 - `SkinnedMeshRenderer` — 标记不可 GPU instance - `RenderQueue` — 按 material/primitive 排序(替代按距离排序), 渲染时检测 instanced batch 并通过 InstanceBuffer 一次 draw - `RenderElement` — 扁平化(移除 SubRenderElement),支持 instancedRenderers 列表 - `ShaderFactory` — UBO 布局计算、instancing GLSL 注入(RENDERER_GPU_INSTANCE macro) - `ShaderPass` — 编译时检测 GPU instance macro,注入 UBO 声明 - `ShaderProgram` — 存储 instanceLayout - `Renderer` — 移除 batched 相关字段 ### 删除文件 - `BatchUtils.ts` → 替换为 `VertexMergeBatcher.ts` - `SubRenderElement.ts` → 合并到 `RenderElement.ts` - `ShaderProgramPool.ts` → 替换为 `ShaderProgramMap.ts` ## cherry-pick 冲突解决记录 fix/shaderlab 分支和 PR 基线 (dev/2.0) 的差异主要在以下文件: 1. **ShaderPass.ts** — fix/shaderlab 使用 `Shader._shaderLab._parseMacros` 处理 ShaderLab 宏,而 PR 使用 `ShaderMacroProcessor.evaluate`(fix/shaderlab 上不存在)。解决方式:保留 fix/shaderlab 的宏处理,加入 PR 的 instancing UBO 注入逻辑。 2. **Transform.glsl** — fix/shaderlab 已有 `camera_VPMat` 声明,PR 也添加了。 解决方式:合并两边声明。 3. **UIRenderer.ts** — PR 将 `BatchUtils` 重命名为 `VertexMergeBatcher`, `batchFor2D` 重命名为 `batch`。fix/shaderlab 的 UI 包未同步。 解决方式:手动更新 UI 包的 import 和调用。 4. **GLSLIfdefResolver.ts** — PR 早期 commit 新增此文件,后续 commit 删除。 cherry-pick 后 ShaderPass.ts 残留了 import。解决方式:删除无用 import。 ## 验证结果 CarParking 游戏 DrawCall 从 905 降至 ~80(同 mesh+material 的座椅、轮子等 自动合批)。
Bug:
|
_scanInstanceUniforms regex-matches uniform declarations without understanding #ifdef blocks. For raw GLSL paths the source still contains preprocessor directives at scan time, so uniforms inside inactive branches (e.g. renderer_JointMatrix under #ifdef RENDERER_HAS_SKIN) get matched even when they won't compile. This caused "GPU Instancing does not support array uniform" errors for plain MeshRenderer batching whenever a SkinnedMeshRenderer had previously registered renderer_JointMatrix under ShaderDataGroup.Renderer. Add _scanInstanceUniformsWithMacros that walks the source line-by-line with a branch stack for #ifdef/#ifndef/#else/#endif, delegating active lines to the original scanner. compilePlatformSource passes its active macro set; the ShaderLab path keeps using the plain scanner since ShaderMacroProcessor.evaluate already expands directives there. Also change the array-uniform fallback from deletion to keeping the declaration as a regular uniform, so stray matches never directly fail shader compilation.
GuoLei1990
left a comment
There was a problem hiding this comment.
增量审查 (2026-04-23)
基于最新 commit 29567302c(fix(shader): scan instance uniforms with macro awareness for raw GLSL)和完整代码状态进行增量审查。
已关闭问题清单
| # | 问题 | 状态 |
|---|---|---|
| 1 | _canBatch 缺少 renderer-level state |
✅ 已修复(macroCollection.isEqual + frontFace) |
| 2 | renderer_LocalMat/renderer_MVInvMat 未删除 |
✅ 已修复 |
| 3 | NormalMat per-vertex inverse() |
✅ 已关闭(GPU ALU 换 UBO 空间,后续优化) |
| 4 | Opaque 距离排序移除 | ✅ 已关闭(TBDR 架构 batch-first 是正确方向) |
| 5 | WebGL2 guard | ✅ 已关闭(_canBatch 上游保证) |
| 6 | SpriteMask 交互 | ✅ 已关闭(与 MeshRenderer instancing 无关) |
| 7 | mat3x4 仿射假设 | ✅ 已关闭(架构保证) |
| 8 | ShaderProgramMap constructor | ✅ 已修复 |
| 9 | array uniform 检测 | ✅ 已修复(报错 + 保留原声明) |
| 10 | _scanInstanceUniforms 不感知 #ifdef |
✅ 已修复(本次 commit) |
| 11 | bindUniformBlock 死代码 |
✅ 已清理 |
| 12 | 正则匹配注释中 uniform 声明 | ✅ 已关闭(后续 ShaderLab 元数据替代) |
| 13 | renderer_LocalMat 移除 breaking change |
✅ 已关闭(dev/2.0 scope) |
| 14 | _std140TypeInfoMap 缺 bool |
✅ 已关闭(极罕见,触发时类型静默跳过不导致编译失败) |
本次 commit 分析
_scanInstanceUniformsWithMacros — 宏感知预处理扫描器
修复了 @zhuxudong 报告的 bug:SkinnedMeshRenderer 和 MeshRenderer 共存时,renderer_JointMatrix 在 #ifdef RENDERER_HAS_SKIN 块内但被正则命中导致报错。
实现方案审查:
逐行解析 + branchStack 状态机,支持 #ifdef / #ifndef / #else / #endif 嵌套。方案简洁有效。关键逻辑验证:
-
嵌套
#ifdef+#else:以common_vert.glsl的实际结构验证——#ifdef RENDERER_HAS_SKIN // MeshRenderer: SKIN 未定义 → push false #ifdef RENDERER_USE_JOINT_TEXTURE // parent inactive → push false uniform sampler2D ... // top=false → skip ✓ #else // parentActive=false, !current=true → false ✓ uniform mat4 renderer_JointMatrix... // top=false → skip ✓ #endif #endif嵌套
#else在 parent inactive 下正确传播不可达性。✓ -
_scanInstanceUniforms单行调用兼容性:_uboUniformRegex有gmflag,String.prototype.replace()在每次调用时重置lastIndex,单行调用安全。✓ -
#if表达式不支持:注释已标注#ifwith expressions treated as always-active。这是保守正确的——多扫描只浪费 UBO 空间,不影响渲染正确性。✓ -
array uniform 从
return ""改为return match:不支持的 array uniform 保留原声明而非删除。正确修复——之前删除会导致编译失败。✓ -
activeMacros构建:在compilePlatformSource中从shaderMacroList构建Set<string>,包含 renderer macro + material macro + engine macro。完整覆盖了编译时宏集合。✓
设计选择:dual path(activeMacros ? WithMacros : plain)
injectInstanceUBO 保留了 activeMacros?: Set<string> 可选参数,无 macro 时走原始正则路径。这是为 ShaderLab 路径保留的——ShaderLab 在编译前已展开 #ifdef,不需要宏感知扫描。分离两条路径是正确的,避免对已展开的源码做不必要的逐行解析。✓
问题
-
[P2]
RenderElement.dispose()将instancedRenderers设为null— 与set()的非 null 假设矛盾// dispose() line 48 this.instancedRenderers = null; // set() line 38 this.instancedRenderers.length = 0; // NPE if instancedRenderers is null
当前
ClearableObjectPool.garbageCollection()调用dispose()后执行elements.length = 0(清空数组),所以 disposed 元素不会被get()返回。在现有 pool 实现下是安全的。但这是一个契约不一致:
dispose()破坏了set()的前置条件。如果未来 pool 实现变化(如 shrink-to-fit GC 只 dispose 尾部多余元素),立刻 NPE。修复成本为零:this.instancedRenderers.length = 0; // 而非 = null
保持数组实例存活,释放内部引用,与
dispose()中其他数组字段(如未来可能的)保持一致语义。
简化建议
-
dispose()中this.shaderData && (this.shaderData = null)和this.texture && (this.texture = null)的&&短路检查不必要——直接赋null即可。对 already-null 字段赋 null 没有副作用或性能开销。 -
_scanInstanceUniformsWithMacros中branchStack.length >= 2 ? branchStack[branchStack.length - 2] : true的 fallback(true)处理了#else出现在栈深度为 1 时的边界情况(即#else无对应#ifdef)。这是防御性编程,GLSL 编译器会先报错,实际不可能触发——可以简化为branchStack[branchStack.length - 2],但保留也无碍。
总评
本次 commit 精准修复了 @zhuxudong 报告的时序依赖 bug。实现方案(逐行 #ifdef 状态机)简洁且正确,嵌套分支处理经验证无误。dual-path 设计为 ShaderLab 路径保留了零开销的原始路径。array uniform 的 return match 修复是正确的伴随修复。
整个 PR 经多轮迭代后质量很高。唯一遗留的 P2(dispose() NPE)是防御性修复,不阻塞合入但建议顺手改掉。
LGTM 👍
cherry-pick from #2957 commit 2956730 compilePlatformSource 路径中 #ifdef 块未展开就执行 _scanInstanceUniforms, 导致非活跃分支内的 renderer_JointMatrix 被正则命中报错。 - compilePlatformSource 构建 activeMacros 传入 injectInstanceUBO - 新增 _scanInstanceUniformsWithMacros,逐行追踪 #ifdef/#ifndef/#else/#endif - array uniform 从 return "" 改为 return match(保留为普通 uniform)
Closes #194
Summary
InstanceBatch将 renderer uniform(ModelMat、Layer 等)打包到共享的 std140 UBO 中ShaderFactory.injectInstanceUBO自动扫描 shader 中的 renderer uniform,替换为 UBO 数组访问 +#define重映射mat3x4存储(仿射优化,48 字节 vs 64 字节),派生 uniform(MVMat/MVPMat/NormalMat)通过#define实时计算MeshRenderer._canBatch/_batch实现合批判定(相同 primitive + material + front-face)ShaderProgram._recordLocation跳过 UBO 成员(location === null),避免无用 ShaderUniform 创建Performance
测试场景: 2500 glTF 模型(Avocado) + 2500 自定义 shader 立方体,全部动态旋转 + 缩放 + 颜色动画
iPhone 实测截图(59 FPS / 21 Draw Calls / 5000 objects):
Future Optimization
injectInstanceUBO通过正则扫描 GLSL 文本获取 renderer uniform 信息。如果 ShaderLab 预编译时提供 uniform 元数据(name, type, group),可以消除正则扫描,改为精确拼接,提升代码健壮性和可扩展性Key Files
RenderPipeline/InstanceBatch.tsRenderPipeline/RenderQueue.tsshaderlib/ShaderFactory.tsshader/ShaderPass.tsshader/ShaderProgram.ts_instanceLayout字段,跳过 UBO 成员反射mesh/MeshRenderer.ts_canBatch/_batch合批逻辑shader/ShaderProgramMap.tsTest plan