Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker-to-Self Channel Example #654

Open
hunhoffe opened this issue Jul 8, 2024 · 5 comments
Open

Worker-to-Self Channel Example #654

hunhoffe opened this issue Jul 8, 2024 · 5 comments

Comments

@hunhoffe
Copy link
Collaborator

hunhoffe commented Jul 8, 2024

I am working on writing a worker-to-worker data transfer example for channels (as part of the grouping of examples that exercise various features of channels, #648).

Draft PR is here: #653

I am basing it off the code in the channel_size example (PR waiting to be merged here: #642)

The channel_size example works well for me. As an intermediate step to adding worker-to-worker communication to that example, I tried to have each worker send data to itself over a channel. That is the version of the code that is pushed in the draft PR #653. The particular file of interest is this one.

When I run with this intermediate step, I get the following error:

Using aiecc.py from:  /scratch/ehunhoff/mlir-air/mlir-aie/install/bin/..
Running: builtin.module(air-insert-launch-and-segment-around-herd,func.func(air-lower-herd-parallel),air-dma-to-channel,canonicalize,cse,air-specialize-channel-wrap-and-stride,func.func(air-renumber-dma),func.func(convert-linalg-to-loops),air-place-herds{num-rows=6 num-cols=4 row-anchor=2 col-anchor=0})
Running: builtin.module(air-to-aie{emit-while-loop=false row-offset=2 col-offset=0 device=npu1_4col})
python3: /scratch/ehunhoff/mlir-air/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:956: void xilinx::air::simpleDMAChannelAllocation(std::vector<MemcpyBundleAsFlow> &, ShimDMAAllocator &, MemTileDMAAllocator &, TileDMAAllocator &): Assertion `core' failed.
Aborted (core dumped)
make: *** [Makefile:7: run] Error 134

My question is:

  • Is a worker allowed to put/get data to/from a channel to itself?
  • Or is this a bug (either in my example code or the air compiler)?
@hunhoffe hunhoffe changed the title Worker-to-Worker Channel Example Worker-to-Self Channel Example Jul 8, 2024
@hunhoffe hunhoffe mentioned this issue Jul 8, 2024
11 tasks
@erwei-xilinx
Copy link
Collaborator

Thanks for investigating into worker-to-worker communication scenario. This scenario hasn't been exercised, so a bug at air compiler is possible. The simpleDMAChannelAllocation is implemented with a naive logic to allocate DMA channels to data movements with endpoints of different memory space; in your case the endpoints are both L1, which might lead to unexpected behaviour.

@erwei-xilinx
Copy link
Collaborator

And this assumption could affect other methods and passes in air compiler.

@hunhoffe
Copy link
Collaborator Author

hunhoffe commented Jul 8, 2024

Ah, I see!

As a path forward, would it make sense to:

  • Create a channel-to-self example with two sub-examples:
    • L1->L1 (mark this as fail for now, and leave an issue up to address this later)
    • L2->L1 (this should presumably succeed)
  • Go ahead and ignore this intermediate error and continue with my plans for a worker-to-worker example, since L1 on one worker is a different memory space than L1 on a different worker?

@hunhoffe
Copy link
Collaborator Author

hunhoffe commented Jul 9, 2024

An update on this plan:

I will leave this issue up until the l1-l1 worker-to-self example is supported (and the lit test passes) or it is decided this is definitely an illegal action, which can be caught by the verifier (hopefully with a helpful error).

@hunhoffe
Copy link
Collaborator Author

This is an issue that directly addresses accessing l2 allocations in the channel: #666

This PR takes that fix and attempts to fix the worker-to-self example using this new capability; however, it still fails with:

python3: mlir-air/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:956: void xilinx::air::simpleDMAChannelAllocation(std::vector<MemcpyBundleAsFlow> &, ShimDMAAllocator &, MemTileDMAAllocator &, TileDMAAllocator &): Assertion `core' failed.
Aborted (core dumped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants