[Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled by adstraw · Pull Request #13844 · apache/tvm

adstraw · 2023-01-25T20:21:35Z

Add software cache management to enable DMA with cache bypass enabled. DMA with cache bypass is an experimental feature requiring software management of the cache. DMA with cache bypass enabled assumes that HexagonBuffer objects are not cached unless explicitly modified by the primfunc. This PR adds cache flush and invalidation after HexagonBuffer allocation with malloc or copy with memcpy to uphold that assumption. In addition, this PR adds logic to flush and invalidate the cache prior to a DMA with cache bypass enabled when the buffer in question has been modified by the primfunc. The test_matmul.py test hits this case by performing layout transforms in global address space ahead of a DMA. CC @csullivan with thanks for providing the test in question.

tvm-bot · 2023-01-25T20:21:38Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @ibsidorenko _{See #10317 for details}

_{Generated by tvm-bot}

Co-authored-by: Chris Sullivan <csullivan@octoml.ai>

janetsc · 2023-01-26T18:11:11Z

+                         for_loop->extent * bufferloadnode->dtype.bytes(), dma_bypass_cache_}));
+
+      // if the buffer we are about to DMA was modified by the primfunc
+      // then we need to flush the buffer from the cache prior to the DMA


This is great! Nice general solution for all primfuncs. Can we detect the directionality of a buffer? Going to VTCM we need a flush on src, coming back we need an invalidate on dst.

Agreed that it's probably better to perform an invalidation vs. flush depending on the directionality of the data transfer:

Upon DMA “read”, you have to flush before

Upon DMA “write”, you have to invalidate after

A helpful example hopefully could be what was done for VTA when VTA would be implemented with non-coherent DMA, as in here:

tvm/vta/runtime/runtime.cc

Lines 1319 to 1329 in bf0607b

if (from_buffer) {

// This is an FPGA to host mem transfer

from_buffer->InvalidateCache(from_offset, size);

from_buffer->MemCopyToHost(static_cast<char*>(to) + to_offset,

static_cast<const char*>(from) + from_offset, size);

} else if (to_buffer) {

// This is a host to FPGA mem transfer

to_buffer->MemCopyFromHost(static_cast<char*>(to) + to_offset,

static_cast<const char*>(from) + from_offset, size);

to_buffer->FlushCache(to_offset, size);

}

@tmoreau89 and @janetsc I think this is really good feedback but I am a little leery to make changes without a failing unit test to use for test driven development as with test_matmul.py in this PR. I imagine that software cache management to enable DMA bypass on Hexagon will be an iterative process. It seems like you are pointing to the next iteration based on VTA example. I would like to let this PR move through on its own merit and then address follow on cases, if possible. Thoughts?

So the next PR could have:

Unit tests that expose the root of the problem (if they don't, we should go back to the drawing board)

Additions to the matmul test that expose the need to invalidate in the VTCM to DDR direction

Changes to insert flush in the right place and invalidate in the right place.

The main reason I was advocating to do both directions now is that I'd like to confirm that those two unit tests do indeed fail. If they don't, we'll need to do more experiments to get to the root of it. The unit tests (from our offline discussion):

DDR to VTCM
Allocate two buffers in DDR and VTCM
Write "0xbeefbeef" to the DDR buffer. Flush.
Write "0xfaceface" to the DDR buffer. Do NOT flush.
DMA from DDR to VTCM with BYPASS ON.
Compare DDR and VTCM buffers. They should not match, because the real values you wanted in DDR weren't flushed to DDR.

VTCM to DDR
Allocate two buffers in DDR and VTCM
Write "0xbeefbeef" to the DDR buffer. (You can flush or not - it actually won't matter.)
Write "0xfaceface" to the VTCM buffer.
DMA from VTCM to DDR with BYPASS ON.
Compare DDR and VTCM buffers. They will not match because there were stale values in the L2 cache for the DDR buffer, so it won't pick up what you wrote with BYPASS ON. DDR needed to be invalidated first.

All that said, with a flush-only operation for DMA, I'll sign off on this as a good incremental step. Thanks!

Definitely fine with getting this PR merged first before we plan to make broader changes. Janet's proposed plan seems quite sound.

janetsc · 2023-01-26T19:13:01Z

-    qurt_mem_cache_clean(reinterpret_cast<qurt_addr_t>(copy.src), copy.num_bytes,
-                         QURT_MEM_CACHE_INVALIDATE, QURT_MEM_DCACHE);


When copying to Hexagon, did we verify that a flat src buffer always passed as a parameter to FastRPC? It could be the object pointer (which has several allocations).

Some clarifications (from our offline discussion, saving here for posterity):

I was wondering what gets passed in the fast RPC call. If it is a pointer to the allocation and not a pointer to the hexagon buffer object, then we shouldn't need any of these cache operations surrounding the memcpy. (Because of Karl's comment that those have code that will make sure buffers passed as arguments are coherent.)

It sounds like DDR is going to be just one allocation in that object. And we do have the map of allocations to buffer objects, so that makes me think we can get rid of everything except a flush on dst when copying to the device, after the memcpy operation.

The reason that is needed is just in case there is no primfunc modifying that data before a DMA to VTCM. In that case, we want to make sure it starts out flushed to DDR.

Co-authored-by: Janet Schneider <janetsc@octoml.ai>

janetsc

Thanks - looks good!

tmoreau89

I like the changes you've applied inside of hexagon_buffer_copy_across_regions. This one is good to go.

adstraw added 2 commits January 25, 2023 13:12

[Hexagon] Software cache management for DMA with cache bypass

70ec6d4

Add test_matmul.py

2d66ffd

Co-authored-by: Chris Sullivan <csullivan@octoml.ai>

adstraw force-pushed the straw-hex-dma-bypass-coherency branch from 104ccd4 to 2d66ffd Compare January 25, 2023 21:19

fix pylint errors

adbe651

janetsc reviewed Jan 26, 2023

View reviewed changes

Comment thread src/runtime/hexagon/hexagon_buffer.cc Outdated

janetsc reviewed Jan 26, 2023

View reviewed changes

Comment thread src/runtime/hexagon/hexagon_buffer.cc Outdated

janetsc reviewed Jan 26, 2023

View reviewed changes

adstraw added 3 commits January 26, 2023 11:44

move CHECK closer to malloc; makes unit tests pass

60414f8

invalidate (only) after malloc

d97d3b5

invalidate src on copy to external; always flush dest

cb91f8c

Co-authored-by: Janet Schneider <janetsc@octoml.ai>

adstraw force-pushed the straw-hex-dma-bypass-coherency branch from da81487 to cb91f8c Compare January 31, 2023 00:33

adstraw requested review from janetsc and tmoreau89 and removed request for janetsc and tmoreau89 January 31, 2023 15:44

adstraw changed the title ~~[Hexagon] Software cache management for DMA with cache bypass~~ [Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled Jan 31, 2023

cache_flush api to use flush instead of flush and invalidate

2aa687d

janetsc approved these changes Jan 31, 2023

View reviewed changes

tmoreau89 approved these changes Jan 31, 2023

View reviewed changes

adstraw closed this Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled#13844

[Hexagon] Detect and flush "dirty" cache prior to DMA read with cache bypass enabled#13844
adstraw wants to merge 7 commits intoapache:mainfrom
adstraw:straw-hex-dma-bypass-coherency

adstraw commented Jan 25, 2023

Uh oh!

tvm-bot commented Jan 25, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

janetsc Jan 26, 2023 •

edited

Loading

Uh oh!

tmoreau89 Jan 28, 2023

Uh oh!

adstraw Jan 31, 2023

Uh oh!

janetsc Jan 31, 2023

Uh oh!

tmoreau89 Jan 31, 2023

Uh oh!

janetsc Jan 26, 2023

Uh oh!

janetsc Jan 30, 2023

Uh oh!

janetsc left a comment

Uh oh!

tmoreau89 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	if (from_buffer) {
	// This is an FPGA to host mem transfer
	from_buffer->InvalidateCache(from_offset, size);
	from_buffer->MemCopyToHost(static_cast<char*>(to) + to_offset,
	static_cast<const char*>(from) + from_offset, size);
	} else if (to_buffer) {
	// This is a host to FPGA mem transfer
	to_buffer->MemCopyFromHost(static_cast<char*>(to) + to_offset,
	static_cast<const char*>(from) + from_offset, size);
	to_buffer->FlushCache(to_offset, size);
	}

		qurt_mem_cache_clean(reinterpret_cast<qurt_addr_t>(copy.src), copy.num_bytes,
		QURT_MEM_CACHE_INVALIDATE, QURT_MEM_DCACHE);

Conversation

adstraw commented Jan 25, 2023

Uh oh!

tvm-bot commented Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janetsc Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmoreau89 Jan 28, 2023

Choose a reason for hiding this comment

Uh oh!

adstraw Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

janetsc Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

tmoreau89 Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

janetsc Jan 26, 2023

Choose a reason for hiding this comment

Uh oh!

janetsc Jan 30, 2023

Choose a reason for hiding this comment

Uh oh!

janetsc left a comment

Choose a reason for hiding this comment

Uh oh!

tmoreau89 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tvm-bot commented Jan 25, 2023 •

edited

Loading

janetsc Jan 26, 2023 •

edited

Loading