[Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled#13844
[Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled#13844adstraw wants to merge 7 commits intoapache:mainfrom
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
Co-authored-by: Chris Sullivan <csullivan@octoml.ai>
104ccd4 to
2d66ffd
Compare
| for_loop->extent * bufferloadnode->dtype.bytes(), dma_bypass_cache_})); | ||
|
|
||
| // if the buffer we are about to DMA was modified by the primfunc | ||
| // then we need to flush the buffer from the cache prior to the DMA |
There was a problem hiding this comment.
This is great! Nice general solution for all primfuncs. Can we detect the directionality of a buffer? Going to VTCM we need a flush on src, coming back we need an invalidate on dst.
There was a problem hiding this comment.
Agreed that it's probably better to perform an invalidation vs. flush depending on the directionality of the data transfer:
-
Upon DMA “read”, you have to flush before
-
Upon DMA “write”, you have to invalidate after
A helpful example hopefully could be what was done for VTA when VTA would be implemented with non-coherent DMA, as in here:
Lines 1319 to 1329 in bf0607b
There was a problem hiding this comment.
@tmoreau89 and @janetsc I think this is really good feedback but I am a little leery to make changes without a failing unit test to use for test driven development as with test_matmul.py in this PR. I imagine that software cache management to enable DMA bypass on Hexagon will be an iterative process. It seems like you are pointing to the next iteration based on VTA example. I would like to let this PR move through on its own merit and then address follow on cases, if possible. Thoughts?
There was a problem hiding this comment.
So the next PR could have:
- Unit tests that expose the root of the problem (if they don't, we should go back to the drawing board)
- Additions to the matmul test that expose the need to invalidate in the VTCM to DDR direction
- Changes to insert flush in the right place and invalidate in the right place.
The main reason I was advocating to do both directions now is that I'd like to confirm that those two unit tests do indeed fail. If they don't, we'll need to do more experiments to get to the root of it. The unit tests (from our offline discussion):
DDR to VTCM
Allocate two buffers in DDR and VTCM
Write "0xbeefbeef" to the DDR buffer. Flush.
Write "0xfaceface" to the DDR buffer. Do NOT flush.
DMA from DDR to VTCM with BYPASS ON.
Compare DDR and VTCM buffers. They should not match, because the real values you wanted in DDR weren't flushed to DDR.
VTCM to DDR
Allocate two buffers in DDR and VTCM
Write "0xbeefbeef" to the DDR buffer. (You can flush or not - it actually won't matter.)
Write "0xfaceface" to the VTCM buffer.
DMA from VTCM to DDR with BYPASS ON.
Compare DDR and VTCM buffers. They will not match because there were stale values in the L2 cache for the DDR buffer, so it won't pick up what you wrote with BYPASS ON. DDR needed to be invalidated first.
All that said, with a flush-only operation for DMA, I'll sign off on this as a good incremental step. Thanks!
There was a problem hiding this comment.
Definitely fine with getting this PR merged first before we plan to make broader changes. Janet's proposed plan seems quite sound.
| qurt_mem_cache_clean(reinterpret_cast<qurt_addr_t>(copy.src), copy.num_bytes, | ||
| QURT_MEM_CACHE_INVALIDATE, QURT_MEM_DCACHE); |
There was a problem hiding this comment.
When copying to Hexagon, did we verify that a flat src buffer always passed as a parameter to FastRPC? It could be the object pointer (which has several allocations).
There was a problem hiding this comment.
Some clarifications (from our offline discussion, saving here for posterity):
I was wondering what gets passed in the fast RPC call. If it is a pointer to the allocation and not a pointer to the hexagon buffer object, then we shouldn't need any of these cache operations surrounding the memcpy. (Because of Karl's comment that those have code that will make sure buffers passed as arguments are coherent.)
It sounds like DDR is going to be just one allocation in that object. And we do have the map of allocations to buffer objects, so that makes me think we can get rid of everything except a flush on dst when copying to the device, after the memcpy operation.
The reason that is needed is just in case there is no primfunc modifying that data before a DMA to VTCM. In that case, we want to make sure it starts out flushed to DDR.
Co-authored-by: Janet Schneider <janetsc@octoml.ai>
da81487 to
cb91f8c
Compare
tmoreau89
left a comment
There was a problem hiding this comment.
I like the changes you've applied inside of hexagon_buffer_copy_across_regions. This one is good to go.
Add software cache management to enable DMA with cache bypass enabled. DMA with cache bypass is an experimental feature requiring software management of the cache. DMA with cache bypass enabled assumes that HexagonBuffer objects are not cached unless explicitly modified by the primfunc. This PR adds cache flush and invalidation after HexagonBuffer allocation with
mallocor copy withmemcpyto uphold that assumption. In addition, this PR adds logic to flush and invalidate the cache prior to a DMA with cache bypass enabled when the buffer in question has been modified by the primfunc. Thetest_matmul.pytest hits this case by performing layout transforms in global address space ahead of a DMA. CC @csullivan with thanks for providing the test in question.