Skip to content

[Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled#13844

Closed
adstraw wants to merge 7 commits intoapache:mainfrom
adstraw:straw-hex-dma-bypass-coherency
Closed

[Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled#13844
adstraw wants to merge 7 commits intoapache:mainfrom
adstraw:straw-hex-dma-bypass-coherency

Conversation

@adstraw
Copy link
Copy Markdown
Contributor

@adstraw adstraw commented Jan 25, 2023

Add software cache management to enable DMA with cache bypass enabled. DMA with cache bypass is an experimental feature requiring software management of the cache. DMA with cache bypass enabled assumes that HexagonBuffer objects are not cached unless explicitly modified by the primfunc. This PR adds cache flush and invalidation after HexagonBuffer allocation with malloc or copy with memcpy to uphold that assumption. In addition, this PR adds logic to flush and invalidate the cache prior to a DMA with cache bypass enabled when the buffer in question has been modified by the primfunc. The test_matmul.py test hits this case by performing layout transforms in global address space ahead of a DMA. CC @csullivan with thanks for providing the test in question.

@tvm-bot
Copy link
Copy Markdown
Collaborator

tvm-bot commented Jan 25, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@adstraw adstraw force-pushed the straw-hex-dma-bypass-coherency branch from 104ccd4 to 2d66ffd Compare January 25, 2023 21:19
Comment thread src/runtime/hexagon/hexagon_buffer.cc Outdated
Comment thread src/runtime/hexagon/hexagon_buffer.cc Outdated
for_loop->extent * bufferloadnode->dtype.bytes(), dma_bypass_cache_}));

// if the buffer we are about to DMA was modified by the primfunc
// then we need to flush the buffer from the cache prior to the DMA
Copy link
Copy Markdown
Contributor

@janetsc janetsc Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Nice general solution for all primfuncs. Can we detect the directionality of a buffer? Going to VTCM we need a flush on src, coming back we need an invalidate on dst.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it's probably better to perform an invalidation vs. flush depending on the directionality of the data transfer:

  • Upon DMA “read”, you have to flush before

  • Upon DMA “write”, you have to invalidate after

A helpful example hopefully could be what was done for VTA when VTA would be implemented with non-coherent DMA, as in here:

tvm/vta/runtime/runtime.cc

Lines 1319 to 1329 in bf0607b

if (from_buffer) {
// This is an FPGA to host mem transfer
from_buffer->InvalidateCache(from_offset, size);
from_buffer->MemCopyToHost(static_cast<char*>(to) + to_offset,
static_cast<const char*>(from) + from_offset, size);
} else if (to_buffer) {
// This is a host to FPGA mem transfer
to_buffer->MemCopyFromHost(static_cast<char*>(to) + to_offset,
static_cast<const char*>(from) + from_offset, size);
to_buffer->FlushCache(to_offset, size);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmoreau89 and @janetsc I think this is really good feedback but I am a little leery to make changes without a failing unit test to use for test driven development as with test_matmul.py in this PR. I imagine that software cache management to enable DMA bypass on Hexagon will be an iterative process. It seems like you are pointing to the next iteration based on VTA example. I would like to let this PR move through on its own merit and then address follow on cases, if possible. Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the next PR could have:

  1. Unit tests that expose the root of the problem (if they don't, we should go back to the drawing board)
  2. Additions to the matmul test that expose the need to invalidate in the VTCM to DDR direction
  3. Changes to insert flush in the right place and invalidate in the right place.

The main reason I was advocating to do both directions now is that I'd like to confirm that those two unit tests do indeed fail. If they don't, we'll need to do more experiments to get to the root of it. The unit tests (from our offline discussion):

DDR to VTCM
Allocate two buffers in DDR and VTCM
Write "0xbeefbeef" to the DDR buffer. Flush.
Write "0xfaceface" to the DDR buffer. Do NOT flush.
DMA from DDR to VTCM with BYPASS ON.
Compare DDR and VTCM buffers. They should not match, because the real values you wanted in DDR weren't flushed to DDR.

VTCM to DDR
Allocate two buffers in DDR and VTCM
Write "0xbeefbeef" to the DDR buffer. (You can flush or not - it actually won't matter.)
Write "0xfaceface" to the VTCM buffer.
DMA from VTCM to DDR with BYPASS ON.
Compare DDR and VTCM buffers. They will not match because there were stale values in the L2 cache for the DDR buffer, so it won't pick up what you wrote with BYPASS ON. DDR needed to be invalidated first.

All that said, with a flush-only operation for DMA, I'll sign off on this as a good incremental step. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely fine with getting this PR merged first before we plan to make broader changes. Janet's proposed plan seems quite sound.

Comment on lines -244 to -245
qurt_mem_cache_clean(reinterpret_cast<qurt_addr_t>(copy.src), copy.num_bytes,
QURT_MEM_CACHE_INVALIDATE, QURT_MEM_DCACHE);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When copying to Hexagon, did we verify that a flat src buffer always passed as a parameter to FastRPC? It could be the object pointer (which has several allocations).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clarifications (from our offline discussion, saving here for posterity):

I was wondering what gets passed in the fast RPC call. If it is a pointer to the allocation and not a pointer to the hexagon buffer object, then we shouldn't need any of these cache operations surrounding the memcpy. (Because of Karl's comment that those have code that will make sure buffers passed as arguments are coherent.)

It sounds like DDR is going to be just one allocation in that object. And we do have the map of allocations to buffer objects, so that makes me think we can get rid of everything except a flush on dst when copying to the device, after the memcpy operation.

The reason that is needed is just in case there is no primfunc modifying that data before a DMA to VTCM. In that case, we want to make sure it starts out flushed to DDR.

@adstraw adstraw force-pushed the straw-hex-dma-bypass-coherency branch from da81487 to cb91f8c Compare January 31, 2023 00:33
@adstraw adstraw requested review from janetsc and tmoreau89 and removed request for janetsc and tmoreau89 January 31, 2023 15:44
@adstraw adstraw changed the title [Hexagon] Software cache management for DMA with cache bypass [Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled Jan 31, 2023
Copy link
Copy Markdown
Contributor

@janetsc janetsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - looks good!

Copy link
Copy Markdown
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes you've applied inside of hexagon_buffer_copy_across_regions. This one is good to go.

@adstraw adstraw closed this Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants