[TE][TIR] Implement layout transformations, non-flat memory buffers #9727

Lunderberg · 2021-12-13T15:12:45Z

This is an implementation of RFC#0039, which adds layout transformations to both flat and non-flat memory buffers.

The commits in this PR are split into 4 sub-sections. Subsection 1 is a refactor that should maintain all existing functionality, while enabling the feature additions in later subsections. Subsections 2-4 are feature additions, each of which starts with a commit adding unit tests for that feature.

Replacing Load/Store nodes with BufferLoad/BufferStore nodes throughout the entire lowering flow. Access into flat memory regions, previously represented by Load/Store nodes, are now represented by BufferLoad/BufferStore nodes that access a 1-d memory space.
Implement .transform_layout(), a scheduling step in TE that alters the in-memory layout of a buffer.
Add te.AXIS_SEPARATOR to .transform_layout() to define groups of transformed axes, where each group is flattened to a single axis. These non-flat memory regions are represented by N-d buffer objects, as enabled in step 1.
Expose the transformed axes as the return value of .transform_layout(), similar to the existing .fuse() and .split() functions, to allow additional manipulation by the user.

The commits are organized internally by git merge --no-ff commits for each sub-section. Unfortunately, it looks like those don't show up very cleanly in github's linear list of commits, but can be seen with git log --graph --format=format:'%h - %s %d', the output of which is below.

Output of git log --graph --format=format:'%h - %s %d'


*   e017a87c5 - Breakpoint, expose the transformed axes for use in TE scheduling.  (HEAD -> physical_layout)
|\  
| * d89f21479 - [TE] Return transformed iteration variables 
| * 9de453da0 - [TE] Rewrite loop iteration order 
| * 69f59e6ca - [TE] Implement te::Transform 
| * 12f6c2dd0 - [UnitTest] Added tests for loop iteration order. 
|/  
*   9868fd5c7 - Breakpoint, axis separators defined. 
|\  
| * 1bdb548e3 - [TE] Fill BufferNode::axis_separators from StageNode 
| * 749a3c651 - [TE] Added Stage::set_axis_separators. 
| * 3e3c04ebe - [TIR] Added BufferNode::axis_separators 
| * 3317e1d4c - [UnitTest] Test N-d indices exposed to low-level codegen 
|/  
*   0e426b45a - Breakpoint, layout_transform implemented. 
|\  
| * 7ce30a2dc - [TIR] Expose tir.transform.ApplyPhysicalLayout for testing 
| * a78ec6f52 - [TIR] Added ApplyLayoutTransforms as part of StorageFlatten. 
| * db0c190a9 - [TIR] Added PrimFunc attribute "layout_transform_map", filled from TE. 
| * 1d0fe1098 - [TE] Added Stage.transform_layout to the Python TE interface. 
| * 46e173d9f - [TE] Added Stage::transform_layout to the C++ TE implementation. 
| * 5c278837f - [TIR] Added IndexMap class. 
| * bc9c5d242 - [UnitTest] Add unit tests to test physical layout remapping. 
|/  
*   d2b2a52f5 - Breakpoint, removed Store/Load nodes from use. 
|\  
| * 7453a8b22 - Added pre_flattened_shape/pre_flattened_stride fields to Buffer. 
| * 5bd3fd619 - Replace Store/Load with BufferStore/BufferLoad in ir_builder 
| * 31698f0d5 - Updated Buffer::vstore/vload to return BufferLoad/BufferStore objects. 
| * 79c6f1ae0 - Updated tvm::address_of() to hold BufferLoad instead of Load. 
| * 363d8c507 - Replacing Load/Store in codegens. 
| * 4687caa4a - Replacing Store/Load in lowering/legalization passes. 
| * b7daea9c0 - Replacing Store/Load in analysis functions 
| * 8f8eeb40d - Replacing Store/Load in utility passes. 
| * 347865de6 - Replacing Store/Load in StorageFlatten 
| * ffd0737a0 - Removing Store/Load from examples 
| * be1555d91 - Removing Store/Load from optimization passes 
| * 9debd1f55 - Replacing Store/Load in Stmt/Expr Visitor/Mutator 
| * 4916d2c87 - [TIR] Added BufferLoadNode::LegalizeDtype 
|/  
* 5557b8c4e - Improve tvmc error message from lazy-loading frontend imports (#9074)  (upstream/main, main)

Lunderberg · 2021-12-13T15:13:17Z

@vinx13 @csullivan @tmoreau89 @adstraw @cconvey

…10787) * [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite. As a follow-up from #9727, restricting StorageRewrite to only modify flat memory buffers. When rewriting, the existing algorithm in StorageRewrite flattens N-d allocations into 1-d allocations, preventing them from being exposed to the codegen. * Bugfix, flattening of Allocate/AllocateConst extents Previously, these were ignored entirely. This worked so long as all allocations were 1-d, as `StorageRewrite` erroneously flattened merged arrays into 1-d.

`PrimFuncNode::preflattened_buffer_map` was introduced in apache#9727, in order to maintain a record of the pre-flattened buffer shape until it can be used in `MakePackedAPI`. This commit instead maintains the pre-flattened shapes in `PrimFuncNode::buffer_map`, while the body of the function uses a flattened buffer alias. Passes LLVM tests in test_target_codegen_llvm.py as initial proof of concept.

…pache#9727) * [TIR] Added BufferLoadNode::LegalizeDtype When modifying a BufferLoad object, the return dtype must also be updated. This exposes the legalization function, so that passes that use `BufferLoad::CopyOnWrite` to modify the buffer/indices don't need to repeat the logic to update the dtype returned. * Replacing Store/Load in Stmt/Expr Visitor/Mutator * Removing Store/Load from optimization passes - UpdatePointerStorageScope - UnrollLoop - ThreadSync - LinearAccessPatternFinder - StoragePlanRewriter - VectorTypeRewriter - VectorTypeAccessChecker - NarrowDataType - IRConvertSSA - CompactBufferRegion * Removing Store/Load from examples - ConvertAddToSubtract * Replacing Store/Load in StorageFlatten Now, outputs BufferLoad/BufferStore with a flattened buffer object. temp commit, replacing Store/Load, BufferBindUnwrapper temp commit, replacing Store/Load, StorageFlattener * Replacing Store/Load in utility passes. - StmtSimplifier - IRSubstitute - BaseInliner - FeatureVisitor * Replacing Store/Load in analysis functions - StorageAccessVisitor - VarTouchedAnalysis - MemoryAccessVerifier - InplaceOpVerifier - GPUCodeVerifier - VarTouchVisitor - LCADetector - BlockReadWriteDetector - InstrumentBoundCheckers * Replacing Store/Load in lowering/legalization passes. - MakeCrossThreadReduction - CacheReadRewriter/CacheWriteRewriter - InjectVirtualThread - InjectDoubleBuffer - InjectCopyIntrin - LowerWarpMemory - LowerThreadAllreduce - LowerThreadAllreduce - LowerCustomDatatypes - LowerTVMBuiltin - CoProcSync - MergeDynamicSharedMemAllocations - VectorizeLoop - BF16Legalize * Replacing Load/Store in codegens. - Device code generators - CodegenC - CodegenLLVM - CodeGenOpenCL - Utilities used during codegen - ArgBinder - MakePackedAPI - ReturnRewriter - SplitHostDevice - Execution environments - CodeGenStackVM - CodeGenHybrid - AOTExecutorCodegen * [UnitTest] Add unit tests to test physical layout remapping. * Updated tvm::address_of() to hold BufferLoad instead of Load. * [TIR] Added IndexMap class. Holds a set of variables representing the input indices and expressions in terms of those input indices. TODO: - Add validation, the index mapping should be invertible. - Add helper function, apply mapping to a set of indices. - Add helper function, apply mapping to bounds of input indices. * Updated Buffer::vstore/vload to return BufferLoad/BufferStore objects. StorageFlatten/FlattenBuffer passes updated to modify the buffer/indices directly, rather than using vload/vstore. - Primary purpose of vstore/vload is to allow IR written in python to define vectorized load/store. This usage is maintained by returning a BufferLoad/BufferStore node whose index is a Ramp. - Previously, vstore/vload was also used to compute the 1-d physical index of a location within a N-d tensor. This usage will no longer be allowed, as it would not allow layout transformations to be performed after a schedule definition, but any uses of the buffer are flattened. * [TE] Added Stage::transform_layout to the C++ TE implementation. Adds an `Array<IndexMap>` in the stage to define the transformations to be applied on the tensor's layout. As of this commit, this mapping isn't propagated into the TIR graph yet. * Replace Store/Load with BufferStore/BufferLoad in ir_builder * [TE] Added Stage.transform_layout to the Python TE interface. Allows users to specify `s[A].transform_layout(mapping)`, and propagate into the TE definitions. * Added pre_flattened_shape/pre_flattened_stride fields to Buffer. The shape and stride checks performed in ArgBinder::BindDLTensor (called from MakePackedAPI) require the tensor shape/strides prior to index flattening. Therefore, though it is no longer used by the low-level code generators, we must maintain that information for use in MakePackedAPI. * [UnitTest] Test N-d indices exposed to low-level codegen When using te.AXIS_SEPARATOR in the call to .transform_layout, this should define groups of axes, each of which is flattened to a single axis, then exposed to the low-level codegen. * [TIR] Added PrimFunc attribute "layout_transform_map", filled from TE. Propagated the TE definition of the physical layout into the TIR graph. * Added pre_flattened_type. If a boolean tensor is backed by an int8 buffer, the check on the argument buffer's type should be against the boolean type. When rebasing this PR, should be placed after the addition of pre_flatten_shape/pre_flatten_strides. * [UnitTest] Added tests for loop iteration order. After transformation, the iteration order should follow the new transformed axes. In addition, the loop iteration variables should be exposed through the TE interface for further manipulation. * [TIR] Added BufferNode::axis_separators - Add axis_separators to represent divisions between groups of tensor axes, where each group is flattened into a single output axis, to be exposed to the low-level code generators. - Expose axis_separators to the python interface. - Update existing C++ calls to the Buffer() constructor. * [TIR] Added ApplyLayoutTransforms as part of StorageFlatten. For any buffers that have layout transforms defined in the "layout_transform_map" attribute of a PrimFunc, rewrite access into the buffer such that they use the updated ordering. * Update usage of ir_builder where necessary. * [TE] Implement te::Transform Similar to Fuse and Split, this represents a modification to the existing loop iterations. * [TE] Added Stage::set_axis_separators. In C++, this is implemented as an `Array<IntImm>`, specifying pre-flatteneing axes after which a new post-flattening should be started. The python interface uses a sentinel value `te.AXIS_SEPARATOR` in the call to `transform_layout`, which is then used to define the array of axis separators. * [TIR] Expose tir.transform.ApplyLayoutTransforms for testing * [TE] Rewrite loop iteration order After .transform_layout, rewrite leaf_iter_vars to follow the updated order. Use the te::Transform iter_var relationship to track use of the transformed variable. * [TE] Fill BufferNode::axis_separators from StageNode During ScheduleOps and SchedulePostprocToPrimfunc, the axis separators defined in the stage must be passed through to the TIR BufferNode. * [TE] Return transformed iteration variables * Moved Buffer's pre-flatten information to PrimFunc. Since the pre-flatten information is only used for validating user inputs, it makes much more sense to store it alongside the buffer_map. * Updated ethos-u C++ unit tests to remove use of Load/Store. * Bugfix, layout transformation. Error occured during conversion from TE to IRModule, when layout transforms were applied to a reader of a `cache_read`. * In test directory, replacing all instances of T.load. * Return buffer object from tvm.tir.script.scope_handler.Allocate Now that the load/store require buffer objects, allocation should also return a buffer object to be used. * Added .astype to tvm.script.tir.node.BufferSlice Since `buf[i]` returns a `BufferSlice`, this lets the TIR examples that use `buf[i].astype('out_dtype')` continue functioning. * Replacing all T.store TIR calls. * Added LOG(FATAL) in constructor of Store/Load nodes. * Updated tvmscript parser to report error for Store/Load nodes. * [TVMScript] Added T.preflattened_buffer stmt Used to specify `PrimFunc::preflattened_buffer_map`. Takes an argument of the postflattened buffer, so that it will work for both simple declarations and `T.match_buffer` statements without needing to introduce a param handle. All other arguments are identical to `T.match_buffer.` * [TVMScript] Updated TVMscript for BufferLoad/BufferStore - Use `T.preflattened_buffer` calls in TVMScript to represent `PrimFunc::preflattened_buffer_map`. - Remove `T.buffer_decl` for return value of `T.allocate`, now that `T.allocate` returns a buffer. - For buffer access as a different type, make a `T.buffer_decl` for those accesses. * Updated test_tvmscript_roundtrip.py for BufferLoad/BufferStore. * Updated TIR reference in USMP pool allocation unit tests. Using let var handles as the data pointer in buffers, rather than just as `T.load`/`T.store` arguments, requires annotation as `T.Ptr[T.primtype]`, rather than as `T.handle`. * fixup! Return buffer object from tvm.tir.script.scope_handler.Allocate * fixup! Return buffer object from tvm.tir.script.scope_handler.Allocate * fixup! Replacing all T.store TIR calls. * fixup! Replacing all T.store TIR calls. * fixup! Return buffer object from tvm.tir.script.scope_handler.Allocate * fixup! In test directory, replacing all instances of T.load. * tir.ComputeInline, correct variable count. Previously, this metaschedule primitive relied on `tir::UndefinedVars` ignoring the data pointer of BufferLoad/BufferStore nodes. When `tir::UndefinedVars` was updated to visit the data pointer, similar to the previous behavior when visiting Load/Store nodes, this caused the count of undefined variables to be unexpectedly high. * fixup! Replacing all T.store TIR calls. * fixup! Updated Buffer::vstore/vload to return BufferLoad/BufferStore objects. * fixup! In test directory, replacing all instances of T.load. * fixup! In test directory, replacing all instances of T.load. * fixup! Replacing all T.store TIR calls. * Expose Buffer index flattening function to Python. * Updated test_tir_buffer.py offset tests. Replacing calls to `Buffer.vload` with `Buffer.offset_of`, when testing the index calculations. * fixup! Replacing all T.store TIR calls. * fixup! Replacing all T.store TIR calls. * fixup! Updated Buffer::vstore/vload to return BufferLoad/BufferStore objects. * fixup! Replacing Store/Load in lowering/legalization passes. * fixup! Replacing all T.store TIR calls. * fixup! Updated ethos-u C++ unit tests to remove use of Load/Store. * fixup! Replacing Store/Load in lowering/legalization passes. Fix linting for inject_double_buffer.cc * fixup! Updated ethos-u C++ unit tests to remove use of Load/Store. * fixup! Added .astype to tvm.script.tir.node.BufferSlice * fixup! In test directory, replacing all instances of T.load. * fixup! Replacing all T.store TIR calls. * fixup! Replacing all T.store TIR calls. * fixup! In test directory, replacing all instances of T.load. * fixup! Replacing all T.store TIR calls. * fixup! Replacing Store/Load in lowering/legalization passes. * [UnitTests] Added T.preflattened_buffer in expected result * fixup! In test directory, replacing all instances of T.load. * [UnitTests] Bound checker update, compare against N-d buffer bounds. * Fixup, bound checker vectorize test. * fixup! Return buffer object from tvm.tir.script.scope_handler.Allocate * [UnitTest] Fixed breakage in InjectRollingBuffer test. Needed a bit more re-writing than usual, because the test was explicitly calling lowering passes, then calling `tvm.build`. Fixed by using the standard lowering flow, with preprocessing steps inserting with `tir.add_lower_pass`. * fixup! Return buffer object from tvm.tir.script.scope_handler.Allocate * [UnitTest] Fixed breakage in flatten buffer unit tests. - Updated pass to allow BufferStore/BufferLoad nodes to be visited before the block's alloc buffer. - Added `T.preflattened_buffer` annotations. * fixup! Return buffer object from tvm.tir.script.scope_handler.Allocate * [UnitTests] Fixed breakage in test_tir_buffer.py - Updated vload test for new behavior. - Added test for offset_of, testing behavior no longer in vload. - Added null check for buffer visitor. * fixup! Replacing Load/Store in codegens. * [UnitTest] ComputeInline, opaque access test updates * [UnitTest] Fixup, allow unit test to use `ib.pointer()[0]`. * fixup! Replacing Load/Store in codegens. The updated CodegenLLVM should use the BufferStore/BufferLoad convention of indexing by `sizeof(dtype)`, rather than `sizeof(dtype.element_of())`. * fixup! Replacing Store/Load in lowering/legalization passes. BF16Legalize should also update the preflattened_buffer_map, since it is overwriting the `BufferNode::data` stored in the buffer_map. * fixup! Replacing all T.store TIR calls. * Fixed failing codegen c host unit tests. - Generated functions were making `uint8_t*` parameter arguments for array handle for return value, rather than the earlier `void*`. - New parameter type was due to using `PointerType(PrimType(DataType::UInt(8)))` as the type annotation, to be usable as `BufferNode::data`. - Changing to `PointerType(PrimType(DataType::Void()))` still allows usage as buffer, more appropriately expresses semantics. - Updated C codegens to allow `void*` types to be generated from variables with type annotation, in addition to the previous behavior of `DataType::Handle()` variables without type annotation. * Fixup, StorageFlatten when applied to post-StorageRewrite functions. Identified in a test that applied `tvm.lower`, then `tvm.build` on the result. If the result of an allocate node is used as the backing buffer for multiple buffers, such as the output of the StorageRewrite pass, then StorageFlatten would erroneously think that the second occurrence was an usage without earlier definition. * fixup, StorageFlatten When flattening a boolean buffer, the backing buffer should have type int8, not the preflattened buffer. * Bugfix, correctly represent void* in LLVM IR. * Update, replace tir.Load with tir.BufferLoad * Added TVMScript error check for matching buffer/index dimensionality Needed for tests/python/unittest/test_tvmscript_error_report.py::test_high_dim_store * Bugfix, correct return type when lowering custom datatype. * Bugfix, removed unused primfunc from test_tvmscript_complete.py * Updated test_meta_schedule_postproc_verify_gpu_code.py TIR Replaced Load/Store with BufferLoad/BufferStore. * Allowed ramp nodes with buffer use analysis. * Updated tests in test_meta_schedule_postproc_verify_gpu_code.py Needed dummy writes to prevent buffer resizing, in order to trigger the verification failure due to memory limits. * Updated TIR examples to be compatible with buffer dimension check. * Corrected section header in docstring. * Corrected indices size check in CogeGenC. * Fixed breakage in LowerThreadAllreduce. Since the AllocateNode is rewritten, any buffers that refer to those variables must also be rewritten. * [UnitTests] Replaced Store/Load in CUDA codegen tests. * Resolved breakage in C-based codegen for vectorized store/load. Needed to update to new convention of using the buffer's element type as the stride. * Bugfix, incorrect LCA for buffer access in root scope. This had been present before the BufferLoad/BufferStore changes, but hadn't triggered on tests using Load/Store nodes. * Added docstrings for TransformNode member variables. * Added TODO for future removal of preflattened_buffer_map. * Fixup, transform layout + cache write tests. The correct sequence is to first apply any caching as needed, then to apply layout transformations, and finally to apply thread binds for the computation step. * Bugfix, correct element type for scalarized access. * Bugfix, cuda buffer indexing when declared as different type. * Cuda codegen, update reference. * Bugfix, lower allreduce Loads of the output of the reduction should be replaced for all buffers sharing a buffer pointer, not just for the buffer object itself. * Removed obsolete comment. * Changed PrimFunc constructor preflattened_buffer_map to Optional * Removed flatten_buffer argument from T.match_buffer. * Correct call to VarUseDefAnalysis::VisitBuffer * Reverted unintentional testing change, lanes=2. * Updated lower_cross_thread_reduction to use buffer in allreduce * Updated transform_layout test to disable CSE * Updated CSE unit tests to use BufferStore * Replaced Store/Load for vta.transform and unit tests. * Updated unit tests for lower_cross_thread_reduction. * Updated arange to use scalar tensors. The start/stop/step tensors are declared as 0-d scalar tensors, but were accessed as 1-d tensors. * Fix breakage in ethosu constant encoding. Buffers generated by "ethosu_copy" should have their buffer objects rewritten, but shouldn't have their size updated in ethosu-specific Call nodes. * Fix breakage in ethosu call argument checks. Need to pull out indices from BufferLoad holders, not Load. * Resolve breakage from mismatched shape/index dimensions * Split out encoded parameters from preflattened buffer map. * Updated buffer shape/index dimensions to match in more ethosu tests * Fixed lint error * Removed debug code * Moved arith::Analyzer local variable to class member * Fixed SSA conversion of allocations. Can occur if allocation is inside an unrolled loop. Added unit test to catch this failure mode. * Ethos-u index/buffer dimension updates. * Updated ethosu passes to handle buffer load/store. * Resolved bug in tvmscript printing of duplicate buffers. * Fix breakage in ethos-u test_assign_addresses, encode constants * Apply same changes to T.allocate_const as to T.allocate Return a buffer when used in TVMScript, allow for aliasing buffers. * Fix lint errors. * Further updates for ethos-u tests. * Updated ethos.u buffer sizes in test. * Updated tir.BindParams to use BufferLoad instead of Load. * Updated topi.cuda.scan implementation to follow buffer dimensions. * Resolved breakage when flattening AllocateConst nodes. * Resolved breakages from latest merge with main. * Corrected error in merge. * Use empty indices for rank-0 tensor. * Added ir_builder workaround for 1-d indexing. * Consistent buffer access type in LLVM codegen, to match C codegen * StorageRewrite, update indices of modified buffers. * Dynamic relay nodes, access 0-d tensors with 0-d indices. * BFloat16 legalization, update buffer type. * Updated meshgrid to use 0-d index for 0-d buffer. * Corrected boolean handling in Allocate nodes. * Added workaround to unpack 1-d Tensor indices into N-d buffer indices. * Resolved a few more failures in relay tests on cuda. * Resolve linting * CI bump * Updated renormalize_split_pattern tests to use BufferLoad/BufferStore * Fixed cuda codegen checks for BufferStore/Ramp. * Simplify indices further, needed to avoid cuda register limit. * fixed dyn onehot shape func accessing 1d buffer with () * Fixed codegen indexing for int4 scalar types. * Temporary workaround for incorrect constant folding. Need to further investigate vectorized LLVM constants * s/find_allocate_usage/FindAllocateUsage/g * Added buffer type consistency TODO. * Improved comment on address_of Op. * Rename LegalizeDtype to LegalizeDType, made private. * fix format and lint errors * Disable vectorization of AllocateConst buffer in StorageRewrite. * Pass buffer_map through to the PrimFunc in cmsisnn * try disabling problematic winograd test case * try different way of buffer mapping in storage_rewrite * Removed unnecessary ramp node in ir_builder. * Updated LLVM codegen for buffer indexing. TVM data arrays are always densely packed. If the LLVM type corresponding to a vectorized TVM datatype contains padding for alignment, the array location should be computed based on the primitive element type. Co-authored-by: Masahiro Masuda <masahi129@gmail.com> Co-authored-by: adstraw <astraw@octoml.ai>

…gen (apache#10339) * init * upd * upd * lint * lint again * upd * add m16n8k32 testcase * format * use make_tuple instead of initializer list * add metadata offset * upd * docstring and sanity * add u8s8s32 back * improvement * compatible apache#9727

A check for unsigned integer overflow would throw an error if it encountered 0U - 0U. apache#10484, which introduced the check, and apache#9727, which introduced this edge case, were in CI at the same time, and each was tested against a merge candidate that did not include the other. The unittest failure only occurred when both PRs were merged.

…che#10520) * [Hexagon] Resolve breakage in test_hexagon/test_cache_read_write Breakage was caused by apache#9727, which didn't account for the new `builtin::mem_copy()` when computing the stack size in `StackSizeChecker`. * Added comment indicating need for StackSizeChecker::MakeMemCopy. * Updated unittests to run all contrib/test_hexagon at CI. * CI bump * Fix lint formatting error. * Updated fix to remove StackSizeChecker entirely. * Bugfix, verify the precheck's allocations, not own. * Bugfix, pass context information to the precheck.

…nd add lowering for VTCM / Hexagon (apache#10558) * repurpose texture flatten for vtcm; TIR lowering correct * clean up remaining code in texture flatten pass * add Alloc and FreeTexture, but failing to run over rpc * test passing with malloc in the device api * cleanup * fails in very reliable way with memory corruption * working with non-HexagonBuffer vtcm alloc * cleanup * do not pass scope through mem_copy api * [Hexagon] Resolve breakage in test_hexagon/test_cache_read_write Breakage was caused by apache#9727, which didn't account for the new `builtin::mem_copy()` when computing the stack size in `StackSizeChecker`. * use HexagonBuffer in Alloc and Free packed funcs * Added comment indicating need for StackSizeChecker::MakeMemCopy. * add AllocVtcmWorkspace and FreeVtcmWorkspace * cleanup * Updated unittests to run all contrib/test_hexagon at CI. * create separate vtcm alloc lowering pass and transform * reset texture_flatten.cc * comments * CI bump * Fix lint formatting error. * Updated fix to remove StackSizeChecker entirely. * pass device and type to device api * Bugfix, verify the precheck's allocations, not own. * Bugfix, pass context information to the precheck. * pass order and shape to device api * working * fix up types and arg passing * pass scope to device api * common builtin for texture / vtcm * add scope to freend api * format and lint * fixed missed format error * restart ci * fix test random value issue + code review feedback * fix test hang * restructure lower vtcm pass per code review feedback (option a) * format error * global.vtcm + tvm_stack_make_shape Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>

…che#10657) This resolves a bug introduced in apache#9727, and adds a test to catch this failure mode. This bug occurred because StorageFlatten's visitor for PrefetchNode inserted additional pre-flattened `BufferLoad` nodes after visiting the body of the Prefetch, preventing those `BufferLoad` nodes from being flattened. Moving this visit to after the insertion of the `BufferLoad` nodes allows the usual buffer flattening to apply to the newly inserted nodes.

…pache#10787) * [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite. As a follow-up from apache#9727, restricting StorageRewrite to only modify flat memory buffers. When rewriting, the existing algorithm in StorageRewrite flattens N-d allocations into 1-d allocations, preventing them from being exposed to the codegen. * Bugfix, flattening of Allocate/AllocateConst extents Previously, these were ignored entirely. This worked so long as all allocations were 1-d, as `StorageRewrite` erroneously flattened merged arrays into 1-d.

Prior to this commit, the BufferAllocationLocator mutator used in the PlanAndUpdateBufferAllocationLocation pass would erroneously insert an entry to `BlockNode::alloc_buffers` for buffers allocated using `Allocate` or `AllocateConst` nodes. This error was introduced in apache#9727, which deprecated `Load` and `Store` nodes, replacing them with `BufferLoad` and `BufferStore` nodes. As a result, BufferAllocationLocator identified these as buffers whose allocations should be moved to inner loops, rather than as unmanaged allocations that should be ignored. This commit restores the earlier behavior by only operating on buffer allocations in `BlockNode::alloc_buffers`, and explicitly ignoring any buffers whose allocation is done with `Allocate` or `AllocateConst`.

* [TIR] Ignore Allocate/AllocateConst in BufferAllocationLocator Prior to this commit, the BufferAllocationLocator mutator used in the PlanAndUpdateBufferAllocationLocation pass would erroneously insert an entry to `BlockNode::alloc_buffers` for buffers allocated using `Allocate` or `AllocateConst` nodes. This error was introduced in #9727, which deprecated `Load` and `Store` nodes, replacing them with `BufferLoad` and `BufferStore` nodes. As a result, BufferAllocationLocator identified these as buffers whose allocations should be moved to inner loops, rather than as unmanaged allocations that should be ignored. This commit restores the earlier behavior by only operating on buffer allocations in `BlockNode::alloc_buffers`, and explicitly ignoring any buffers whose allocation is done with `Allocate` or `AllocateConst`. * Only inject opaque block if managed buffers exist. Previously, all buffers found were managed buffers, so this check wasn't needed.

…e#10998) * [TIR] Ignore Allocate/AllocateConst in BufferAllocationLocator Prior to this commit, the BufferAllocationLocator mutator used in the PlanAndUpdateBufferAllocationLocation pass would erroneously insert an entry to `BlockNode::alloc_buffers` for buffers allocated using `Allocate` or `AllocateConst` nodes. This error was introduced in apache#9727, which deprecated `Load` and `Store` nodes, replacing them with `BufferLoad` and `BufferStore` nodes. As a result, BufferAllocationLocator identified these as buffers whose allocations should be moved to inner loops, rather than as unmanaged allocations that should be ignored. This commit restores the earlier behavior by only operating on buffer allocations in `BlockNode::alloc_buffers`, and explicitly ignoring any buffers whose allocation is done with `Allocate` or `AllocateConst`. * Only inject opaque block if managed buffers exist. Previously, all buffers found were managed buffers, so this check wasn't needed.

`PrimFuncNode::preflattened_buffer_map` was introduced in apache#9727, in order to maintain a record of the pre-flattened buffer shape until it can be used in `MakePackedAPI`. This commit instead maintains the pre-flattened shapes in `PrimFuncNode::buffer_map`, while the body of the function uses a flattened buffer alias. Passes LLVM tests in test_target_codegen_llvm.py as initial proof of concept.

…e#10998) * [TIR] Ignore Allocate/AllocateConst in BufferAllocationLocator Prior to this commit, the BufferAllocationLocator mutator used in the PlanAndUpdateBufferAllocationLocation pass would erroneously insert an entry to `BlockNode::alloc_buffers` for buffers allocated using `Allocate` or `AllocateConst` nodes. This error was introduced in apache#9727, which deprecated `Load` and `Store` nodes, replacing them with `BufferLoad` and `BufferStore` nodes. As a result, BufferAllocationLocator identified these as buffers whose allocations should be moved to inner loops, rather than as unmanaged allocations that should be ignored. This commit restores the earlier behavior by only operating on buffer allocations in `BlockNode::alloc_buffers`, and explicitly ignoring any buffers whose allocation is done with `Allocate` or `AllocateConst`. * Only inject opaque block if managed buffers exist. Previously, all buffers found were managed buffers, so this check wasn't needed.

`PrimFuncNode::preflattened_buffer_map` was introduced in apache#9727, in order to maintain a record of the pre-flattened buffer shape until it can be used in `MakePackedAPI`. This commit instead maintains the pre-flattened shapes in `PrimFuncNode::buffer_map`, while the body of the function uses a flattened buffer alias. Passes LLVM tests in test_target_codegen_llvm.py as initial proof of concept.

- Enabled simplification of `A[i] = A[i] + 0` into no-op. This was a bug introduced in apache#9727, which applied this rewrite only to `A[i] = A[i]`, and not to statements which simplify to `A[i] = A[i]`. Regression test added to prevent reoccurrence of this bug. - Enabled simplification of `x - x` to zero for floating point types. Previously, this simplification was applied only for data types that could be used as buffer indices.

* [TIR] Additional Stmt/Expr simplication rules - Enabled simplification of `A[i] = A[i] + 0` into no-op. This was a bug introduced in #9727, which applied this rewrite only to `A[i] = A[i]`, and not to statements which simplify to `A[i] = A[i]`. Regression test added to prevent reoccurrence of this bug. - Enabled simplification of `x - x` to zero for floating point types. Previously, this simplification was applied only for data types that could be used as buffer indices. * Updated to maintain separate int/float simplification paths * Updated to use tvm.testing.main * Remove duplicate rewrite rules

* [TIR] Additional Stmt/Expr simplication rules - Enabled simplification of `A[i] = A[i] + 0` into no-op. This was a bug introduced in apache#9727, which applied this rewrite only to `A[i] = A[i]`, and not to statements which simplify to `A[i] = A[i]`. Regression test added to prevent reoccurrence of this bug. - Enabled simplification of `x - x` to zero for floating point types. Previously, this simplification was applied only for data types that could be used as buffer indices. * Updated to maintain separate int/float simplification paths * Updated to use tvm.testing.main * Remove duplicate rewrite rules

`PrimFuncNode::preflattened_buffer_map` was introduced in #9727, in order to maintain a record of the pre-flattened buffer shape until it can be used in `MakePackedAPI`. This commit instead maintains the pre-flattened shapes in `PrimFuncNode::buffer_map`, while the body of the function uses a flattened buffer alias, as described in [RFC#70](apache/tvm-rfcs#70)

`PrimFuncNode::preflattened_buffer_map` was introduced in apache#9727, in order to maintain a record of the pre-flattened buffer shape until it can be used in `MakePackedAPI`. This commit instead maintains the pre-flattened shapes in `PrimFuncNode::buffer_map`, while the body of the function uses a flattened buffer alias, as described in [RFC#70](apache/tvm-rfcs#70)

Lunderberg mentioned this pull request Dec 13, 2021

[RFC][TIR] Layout transformations on buffer access apache/tvm-rfcs#39

Merged

Lunderberg force-pushed the physical_layout branch from b961ad1 to 28dc15c Compare December 13, 2021 19:26

Lunderberg requested review from Huyuwei, jcf94, kevinthesun and Laurawly as code owners December 13, 2021 19:26

Lunderberg mentioned this pull request Apr 8, 2022

[TIR] Remove PrimFuncNode::preflattened_buffer_map #10940

Merged

Lunderberg mentioned this pull request Apr 13, 2022

[TIR] Ignore Allocate/AllocateConst in BufferAllocationLocator #10998

Merged

Lunderberg mentioned this pull request May 6, 2022

[TIR][Arith] Implemented padded inverses in IndexMap #11235

Merged

vinx13 mentioned this pull request May 10, 2022

[RFC] Introducing DeclBuffer apache/tvm-rfcs#70

Merged

Lunderberg mentioned this pull request May 19, 2022

[TIR] Additional Stmt/Expr simplication rules #11373

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Lunderberg mentioned this pull request Sep 26, 2022

[TIR][Transform] Clear buffer_map during MakeUnpackedAPI #12891

Merged

yzh119 mentioned this pull request Mar 12, 2023

[TensorIR][Transform] Enable warp shuffling for LowerWarpMemory #14280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TE][TIR] Implement layout transformations, non-flat memory buffers #9727

[TE][TIR] Implement layout transformations, non-flat memory buffers #9727

Lunderberg commented Dec 13, 2021

Lunderberg commented Dec 13, 2021

[TE][TIR] Implement layout transformations, non-flat memory buffers #9727

[TE][TIR] Implement layout transformations, non-flat memory buffers #9727

Conversation

Lunderberg commented Dec 13, 2021

Lunderberg commented Dec 13, 2021